Why Cache Misses Matter In Ultra-Low Latency Trading

The edge derived from ultra-low latency trading is getting order messages to market faster, increasing the number of orders executed or getting market making orders out of the way before being picked off. One way to do this is by carefully managing cache memory.

A computer’s CPU can process data incredibly fast - adding two numbers together in a fifth of a billionth of a second. What slows it down is when the computer is forced to retrieve the value of those numbers from RAM storage. To avoid this, the solution hardware designers came up with is cache memory.

Cache is a small amount of memory that is part of the CPU that can be accessed to perform calculation very much faster than getting the value from RAM. The basic idea is if you need to fetch a value from RAM (or write a value to it) you will probably need to use that same value again very soon. Storing a copy of it in the cache means when you need it you can get it fast. To store it, a value in the cache that has not been used for a while is thrown out (evicted) and that space is used to store the new value.

A cache miss occurs when the program needs to read or write to a particular location in memory but a copy of the value at that location is not already in the cache. The three reasons this occurs are called the three C’s: “Compulsory, Capacity and Conflict.” Of these Capacity cache misses are the most interesting for optimisation. (Compulsory misses happen because all memory locations must be accessed by a program for the first time and data cannot already be in the cache when it does. Conflict misses can occur due to the way memory locations are stored in the cache affecting which of the less recently used cache lines is chosen for eviction.)

Capacity Cache Misses

When accessing a memory location that is not already in the cache, the CPU will load it into the cache and evict something to make space. If what was evicted will be used again soon it will need to be fetched into the cache again and something else evicted. This repeated fetching into and evicting from cache is known as cache thrashing. Using a memory working set larger than the cache will result in thrashing caused by these cache capacity misses.

It is frequently said that the single most important step in optimisation is to minimise cache misses of all memory used by an application. Breaking out Intel’s memory latency checker shows why cache misses are so important. While the numbers are necessarily hardware specific, you see magnitude differences something like these from my desktop (AMD Ryzen 9 3900X 12-Core Processor) comparing the cache latency (L1, the fastest cache) with the RAM latency:

L1 Cache Latency:


 ./mlc --idle_latency -b32 -t10 -c1 -i1 

1.0 nS

 RAM Latency: 

./mlc --idle_latency -b32g -t10 -c1 -i1 

100.6 nS
            

On a trading server the numbers will differ but the order of magnitude of the difference between cache levels will remain. Fetching from RAM takes about 100 times as long as from L1 cache.

Cache Hierarchy

A further improvement made to CPUs by hardware designers was to include more than one cache organised into a hierarchy. So called L1 cache is the smallest and fastest cache. Immediately after a cache line is evicted from L1 it will be found in L2 which is larger, but a bit slower. When a line is evicted from L2 it will be found in L3 which is even larger but also slower still. L3 is typically about 50 times as slow as L1 and L2 around 5 to 10 times as slow.

When executing a very fast instruction such as a simple add, the CPU will have to wait if the value to be added is not in the cache, causing the fast instruction to take 100 times as long. Across the entire data working set and instructions of a program these delays can and do really add up.

An obvious optimal response is to make the code and working data all fit into the L1 cache, ie. make the code and data both small and tight. This requires careful thought, design and coding but is in essence a great idea.

Once you try this you start thinking about how it would be great if you had a little more space to play with. If somehow the caches were bigger. There is a way they can be. Break your application into two processes. Data flows into the first process, is transformed, then flows out into a second process. Here you have to pay the cost of inter-thread latency to push data between the two halves but you effectively increase the size of the caches your now dual process application can use. Is it easier to fit each half of the original program into L1 & L2 than the whole thing? Sure. But at what price? I’ll look at that next time.