Mechanical sympathy: cache, branches, false sharing

TL;DR

Cache hierarchy. L1 ≈ 1 ns, L2 ≈ 4 ns, L3 ≈ 12 ns, RAM ≈ 80 ns. Code that fits in L1 is roughly two orders of magnitude faster than code that stalls on RAM. The unit of movement between levels is a 64-byte cache line.
Branch prediction. A correct prediction is free; a mispredict throws away 15-20 cycles. Branchy code on random data can run 5× slower than the same code on sorted data. Sort, partition, or go branchless on hot paths.
False sharing. Two threads writing different variables that happen to share a cache line will ping-pong the line between cores and run slower than a single-threaded version. Pad hot per-thread state to 64 or 128 bytes.

These three ideas are the foundation of “mechanical sympathy” — the term Martin Thompson borrowed from Jackie Stewart for engineers who design code with the underlying machine in mind. They show up in nearly every HFT systems interview, and getting them right separates production-shaped low-latency code from textbook code.

The cache hierarchy

Every level of memory between the registers and the disk is a cache for the level below it. On a modern x86-64 server CPU:

Level	Size	Latency	Bandwidth
Registers	~1 KB	<1 ns	huge
L1d (per core)	32-48 KB	~1 ns	~1 TB/s
L2 (per core)	256 KB - 1 MB	~4 ns	~500 GB/s
L3 (shared)	tens of MB	~12 ns	~200 GB/s
RAM	tens-hundreds of GB	~80 ns	~50 GB/s
NVMe SSD	TB	~50 µs	~5 GB/s

The numbers are approximate and CPU-specific, but the ratios are stable: each level is roughly an order of magnitude bigger and slower than the one above. Register-to-RAM is about 100× slower than register-to-L1.

The CPU does not move data between levels in single bytes. The unit is a cache line, almost always 64 bytes on x86-64 and AArch64. When you read one byte from RAM, the CPU pulls a 64-byte chunk into L1 and the surrounding bytes are along for the ride.

That has two big consequences:

Sequential access is enormously faster than random access. Walking an array linearly hits the same cache line 64 / size-of-element times before pulling the next one. Walking a linked list typically hits a cold cache line every node.
Locality matters more than algorithmic complexity for moderate n. A linear scan of an array can beat a tree lookup well past n = 10,000 because the tree’s pointer chasing burns cache miss after cache miss.

Practical: keep hot data small and adjacent

The cleanest mechanical-sympathy move is to make sure the data your hot loop touches fits in L1.

A struct of arrays beats an array of structs when the loop only reads a couple of fields. (row[i].price over an array of bloated rows wastes most of every line; prices[i] packs prices densely.)
Pre-allocate. Allocator metadata, scattered allocations, and pointer indirection trash spatial locality.
For tree-shaped data, consider a flat layout (heap-as-array, B-trees with large fanout, succinct data structures).

A real example: an order book’s hot loop is “given a tick, find the level”. The textbook answer is a TreeMap; the production answer is an array indexed by tick. The array version is O(1) and one cache line; the TreeMap is O(log n) and several pointer-chases per lookup. Deep-dive: Order-book data structures.

Branch prediction

The CPU pipeline is dozens of stages deep. By the time it knows whether a branch is taken, the next dozen instructions are already in flight. To avoid stalling, the CPU guesses which way the branch will go and speculates down that path. If the guess is right, no cost. If wrong, the pipeline gets flushed and the wrong-path work is thrown away — typically 15-20 cycles of penalty.

Predictors are good. On a typical server workload, mispredict rates are well under 1%. But “good on average” is not the same as “good on your hot path”.

The classic demo is the famous Stack Overflow question “why is processing a sorted array faster than an unsorted array?” The code is a single branch inside a tight loop:

for (int i = 0; i < N; i++) {
    if (data[i] >= 128) sum += data[i];
}

On random data, the branch is unpredictable — heads/tails. On sorted data, the branch settles into long runs of taken / not-taken and the predictor nails it. The runtime difference can be 5× or more.

Practical moves

Sort or partition data so the branch becomes predictable. Worth it when you process the same data many times.
Go branchless. cmov, bit manipulation tricks, table lookups, or min/max intrinsics replace branches with data-dependent arithmetic. The compiler will often do this for simple cases (x = a < b ? a : b).
Hoist branches out of the loop. If a branch depends on something invariant per iteration, put it outside.
Vectorise. SIMD instructions process 4-16 elements per cycle without a branch per element; the comparison becomes a mask register, the conditional add becomes a masked add.

The general principle: the hot path should have no surprising branches. Either the branch is predictable, or it isn’t there.

This one bites everyone at least once.

Imagine two threads, each with its own counter. They never touch each other’s counter. Surely the increments are independent and scale linearly with cores?

struct Counters { volatile long a; volatile long b; };
Counters counters;
// thread 1: counters.a++
// thread 2: counters.b++

In practice, this can run slower than a single thread doing both increments. The reason: a and b are adjacent in memory, so they sit on the same 64-byte cache line. The cache coherence protocol (MESI on x86) tracks ownership at the cache-line granularity. When thread 1 writes a, the line moves to its core in Modified state, invalidating the copy on thread 2’s core. When thread 2 writes b, the line moves back. Each write becomes a cache-line-bounce — tens of nanoseconds, easily.

This is false sharing. The variables are logically independent; the cache line makes them physically dependent.

Practical moves

Pad hot per-thread state to a full cache line:

struct alignas(64) PaddedCounter {
    volatile long value;
    char padding[64 - sizeof(long)];
};

In Java, the same idea: insert dummy long fields, or use @Contended (since JDK 8, behind -XX:-RestrictContended).

The Disruptor’s ring buffer is the canonical example: head and tail cursors are each padded to 128 bytes (some CPUs prefetch adjacent lines, so 64 isn’t always enough). The result is that a single-producer/single-consumer pair never bounces a cache line between them.

False sharing is invisible in code review. If a piece of “lock-free” code is mysteriously slower than a mutex version, false sharing is the first suspect.

Putting it together

These three ideas compose:

A hot loop that fits in L1 and runs predictable branches over sequentially-laid-out data is roughly the fastest thing a CPU can do. The Disruptor, vectorised SQL engines (DuckDB), and any HFT order book share this DNA.
A loop that fetches scattered data through mispredicted branches and writes back to a falsely shared line is the slowest. Naive hash-map iteration with poor locality, classic textbook linked-list code, and “I added a counter to track stats” patches all qualify.

Mechanical sympathy isn’t an optimisation pass. It’s a way of looking at the data layout first, the algorithm second.

Interview angle. All three come up constantly in HFT systems loops. Be ready to: estimate the cost of a cache miss in nanoseconds; explain why a sorted array can beat an unsorted one through the same code; and spot false sharing as the cause of “lock-free, but slower than a mutex.” Reasoning in concrete numbers is what separates a strong answer from a hand-wave.

TL;DR#

The cache hierarchy#

Practical: keep hot data small and adjacent#

Branch prediction#

Practical moves#

False sharing#

Practical moves#

Putting it together#

Further reading#

TL;DR

The cache hierarchy

Practical: keep hot data small and adjacent

Branch prediction

Practical moves

False sharing

Practical moves

Putting it together

Further reading