Hitting Minimal Inter-Thread Latency Costs

Breaking up a trading application into multiple processes allows us to find more of the program memory in the fastest cache. The cost is sending data between these processes. Understanding the hardware perspective of how this works is crucial to minimising this cost.

(We’ve previously looked into how transforming the use of cache memory helps reduce latency and maximise edge: here)

You may have grown up where a CPU could do one and only one thing at a time and would require an operating system to save everything it was doing, then switch to another program restoring everything from that program only to repeat those steps for the next program and again constantly. This gave us the illusion that our Pentium 4 was running multiple programs at once despite it only running each in isolation, one at a time. Intel released the core 2 duo in 2005 which had 2 cores, this may have been the first consumer multicore processor you encountered. That is, it could execute instructions from 2 separate programs simultaneously.

Core counts have grown from there. Each core is an independent microprocessor with its own dedicated L1 cache. A multi-core CPU has many of these microprocessors that are capable of working together using shared memory. Your phone now probably has 8 cores.

A clever approach taking advantage of this is to break up your data transforming program to run simultaneously as 2 (or more) processes, each on its own CPU core, thereby increasing the amount of L1 (and L2 cache) available to the program as a whole. When you do this, data must then be transferred from one process to the second. This is the cost you must pay for being able to fit more of your program in L1 and L2 cache. Obviously minimising this inter-process latency cost makes sense but how low can it go? Each process is assumed to be single-threaded so we can use the words “process” and “thread” interchangeably in the inter-thread latency analysis. Each process runs with a single hardware thread pinned to its own specific, dedicated core.

Core Proximity = Lower Latency

Depending on requirements and our resulting design choices, cores can be near together or far apart. The closer together, the lower the latency.

The closest 2 cores can be to each other is on the same CPU and package, where the separate cores actually share L3 cache. (Hyperthreading is ignored as it turns out to be not so useful to share L1 & L2, it usually increases thrashing and so is switched off in bios). Next closest is 2 cores that are on the same CPU but on different packages (here L3 cache is not shared between the cores.) The furthest apart cores can be is on separate air-gapped servers requiring physical media to be transported between them (by someone wearing sneakers removing a USB stick from one and plugging it into the other - known as sneakernet) in order to transfer data, or maybe one of the voyager spacecrafts.

Shared L3 Cache for Optimal Trading

For an optimal trading outcome, our preference is to have cores in the same package sharing L3 cache. If we pin processes to cores on the same package, what has to happen to transfer data between them? One of the processes will be sending the data (the writer) to the other (the reader) to read that data. To minimise latency, the reader process must poll shared memory in a busy loop so it can note and react to any change as quickly as possible after it occurs.

This has key consequences for which process’s core owns the cache line being written to and read as quickly as possible thereafter. A reader process running on its own core polling the shared memory requires the core it runs on to own that L3 cache line. The writer process at some point decides to write to that shared memory. This causes a compulsory L1 and L2 cache miss for its core given the reader’s core owns that L3 cache line due to that necessary polling.

The first L3 cache reference latency cost is incurred and the core running the writer thread now owns the cache line and can write to it. That line is also in the writer’s L1 & L2 cache and consequently no longer in the reader’s L1 & L2. The reader polls the shared memory again causing an L1 & L2 cache miss for the reader core. A second L3 cache reference latency cost is incurred.

The theoretical lower bound for the minimum inter-thread latency is twice L3 cache latency.

Core to core latency benchmarks for the hardware show the limits of what is actually possible between any 2 arbitrary cores on a given CPU, usually laid out in a matrix. (See the link in the comments to benchmark data for many different CPU models.)

Optimised Inter-Process Communication Latency

An optimised inter-process communication (IPC) should have a latency of not much more than double the L3 cache reference latency. If it is much more than this you’ve got work to do, any less and alarm bells should ring that something is wrong and maybe you’re not measuring what you thought you were.

Next I’ll look at some of the details of core pinning.