[{"content":"TL;DR\nA single-producer/single-consumer (SPSC) ring buffer hands data between two threads with no locks and no allocation — a handful of instructions per operation. It\u0026rsquo;s the queue at the bottom of every feed handler. Correctness rests entirely on two atomic cursors and acquire/release ordering. On ARM that ordering emits real barriers; on x86 it\u0026rsquo;s nearly free — which is exactly why the design deserves care. Naively padding the cursors to \u0026ldquo;avoid false sharing\u0026rdquo; made my benchmark slower — because both threads still read the opposite cursor every operation, so two cache lines ping-pong instead of one. Caching the opposite cursor removes those cross-core reads. Combined with padding it took the same logic from ~32 to ~440 M ops/s on an Apple M4 Pro. The lesson: eliminate true sharing before you fix false sharing. Padding is not a magic word. The one queue you can\u0026rsquo;t avoid Every low-latency system has the same shape at its core. One thread pulls bytes off a socket and decodes them; another consumes those messages and does something useful — updates an order book, runs a strategy, writes to a journal. Between them sits a queue. On the hot path, that queue is almost always a single-producer, single-consumer ring buffer: the cheapest correct way for two threads to communicate.\nIt\u0026rsquo;s also the canonical interview question for these roles, and a good one: it\u0026rsquo;s small enough to write on a whiteboard, yet it forces you to say something precise about the memory model, about cache coherence, and about the difference between code that looks concurrent and code that is. This post builds one, proves it correct, and then measures three versions — because the interesting part isn\u0026rsquo;t the data structure, it\u0026rsquo;s what the hardware does with it.\nA ring is an array and two cursors The data structure is almost insultingly simple: a fixed array plus two indices. tail is where the producer writes next; head is where the consumer reads next. The buffer is empty when they\u0026rsquo;re equal and full when advancing tail would collide with head. Make the capacity a power of two and wrapping is a single \u0026amp; instead of a branch or a modulo.\nA ring buffer is an array and two cursors head tail 0 1 2 3 4 5 6 7 producer writes at tail, consumer reads at head wrap: next = (i + 1) \u0026amp; (N − 1) The single-producer/single-consumer constraint is what buys the simplicity: because exactly one thread writes tail and exactly one writes head, neither index ever needs a compare-and-swap. Each side owns its own cursor outright. All that remains is making the two threads agree on when a written slot becomes visible — and that\u0026rsquo;s a memory-model question, not a locking one.\nWhy it\u0026rsquo;s correct without a lock Here\u0026rsquo;s the whole engine. No mutex, no compare_exchange, just two atomics and the right ordering on each access.\ntemplate \u0026lt;class T, std::size_t Capacity\u0026gt; // Capacity must be a power of two class SpscRing { std::atomic\u0026lt;std::size_t\u0026gt; head_{0}; // written by consumer only std::atomic\u0026lt;std::size_t\u0026gt; tail_{0}; // written by producer only T buf_[Capacity]; static constexpr std::size_t kMask = Capacity - 1; public: bool push(const T\u0026amp; v) { // producer thread const std::size_t t = tail_.load(std::memory_order_relaxed); const std::size_t next = (t + 1) \u0026amp; kMask; if (next == head_.load(std::memory_order_acquire)) return false; // full buf_[t] = v; // (1) write the slot tail_.store(next, std::memory_order_release); // (2) publish return true; } bool pop(T\u0026amp; out) { // consumer thread const std::size_t h = head_.load(std::memory_order_relaxed); if (h == tail_.load(std::memory_order_acquire)) return false; // empty out = buf_[h]; // (3) read the slot head_.store((h + 1) \u0026amp; kMask, std::memory_order_release); // (4) free it return true; } }; The correctness argument is two release/acquire pairs:\nProducer → consumer (data). The producer writes the slot (1) then does a release store of tail (2). The consumer does an acquire load of tail before reading the slot (3). The release store synchronises-with the acquire load, so once the consumer sees the new tail, it is guaranteed to see the slot write. That is the entire reason this works. Consumer → producer (free space). Symmetrically, the consumer\u0026rsquo;s release store of head (4) publishes \u0026ldquo;this slot is free\u0026rdquo; to the producer\u0026rsquo;s acquire load of head in the full-check. The producer can\u0026rsquo;t overwrite a slot until the consumer has finished reading it. The loads of your own cursor are relaxed because you are its only writer — there\u0026rsquo;s no one to race with. That precision is the point of the memory-ordering arguments these interviews probe: every relaxed, acquire, and release here is load-bearing, and you should be able to say why each one is the weakest ordering that\u0026rsquo;s still correct.\nThis is also why I benchmarked on ARM. On x86\u0026rsquo;s strongly-ordered model, acquire-loads and release-stores compile to ordinary mov — the ordering is essentially free, and you can get the orderings wrong and still pass your tests. On ARM (Apple Silicon here) they lower to ldar/stlr, real acquire/release instructions. The model is weaker, so the barriers do visible work and mistakes actually surface.\nI ran the consumer-and-producer threads against this under ThreadSanitizer for millions of items: no data races reported, every checksum correct. That\u0026rsquo;s the bar before you trust a lock-free structure — \u0026ldquo;it passed once\u0026rdquo; is not evidence.\nAttempt 1: pad the cursors, and watch it get slower Anyone who\u0026rsquo;s read about false sharing knows the next move. head_ and tail_ sit next to each other, so they likely share one 64- or 128-byte cache line. Two cores hammering the same line is the textbook false-sharing pathology. The fix is equally textbook: push each cursor onto its own cache line with alignas.\nSo I did, and benchmarked it: capacity 1024, 200 million uint32 items, producer and consumer on separate cores of an M4 Pro.\nLayout Throughput Per op naive (cursors share a line) ~32 M ops/s ~31 ns padded (cursors on separate lines) ~21 M ops/s ~48 ns Padding made it a third slower. The textbook fix backfired.\nThe reason is the part the textbook leaves out. False sharing hurts when two cores write the same line. But look again at push and pop: the producer reads head_ on every call, and the consumer reads tail_ on every call. Both cursors are genuinely shared — read by one core, written by the other, every single operation. That\u0026rsquo;s true sharing, not false sharing.\nWhy padding backfired: count the bouncing lines shared lines (P ⇄ C each op) throughput P C P C P C naive head+tail 32M/s · 1 line padded head tail 21M/s · 2 lines ↓ cached tail·P head·C 440M/s · ≈0 cross In the naive layout, head and tail live on one line. Both cores touch that one line every op, so exactly one line ping-pongs between the two caches. Splitting them onto separate lines means the producer now reads the consumer\u0026rsquo;s line and the consumer reads the producer\u0026rsquo;s line — two lines ping-pong instead of one. Padding doubled the coherence traffic. The fix was real; I\u0026rsquo;d just aimed it at the wrong problem.\nThe real fix: cache the other cursor The traffic comes from reading the opposite cursor every operation. But you mostly don\u0026rsquo;t need a fresh value. The producer only needs head to check for full — and the buffer is rarely full. So cache it: keep a private, non-atomic copy of the opposite cursor and only refresh it from the shared atomic when the cached value says you\u0026rsquo;re stuck.\nbool push(const T\u0026amp; v) { // producer thread const std::size_t t = tail_.load(std::memory_order_relaxed); const std::size_t next = (t + 1) \u0026amp; kMask; if (next == head_cache_) { // looks full? refresh once and recheck head_cache_ = head_.load(std::memory_order_acquire); if (next == head_cache_) return false; // genuinely full } buf_[t] = v; tail_.store(next, std::memory_order_release); return true; } head_cache_ is an ordinary std::size_t private to the producer; the consumer keeps a tail_cache_ of its own. The cached value is always conservative — a stale head can only make the producer think the buffer is fuller than it is, which costs a wasted refresh, never a correctness bug. With the cursors also padded onto separate lines, each shared line is now written by one core and read by the other only at the rare full/empty boundary. In steady state the cross-core reads nearly vanish.\nLayout Throughput Per op naive ~32 M ops/s ~31 ns padded ~21 M ops/s ~48 ns padded + cached cursor ~440 M ops/s ~2.3 ns Same algorithm, same memory orderings, roughly 14× the throughput of the naive version and 20× the padded one — entirely from controlling which core owns which cache line. This is the whole of mechanical sympathy in one example: the instructions barely changed; the cache-coherence traffic is what moved.\nFix true sharing before false sharing. Padding only pays off once each cache line is owned by one core. Reach for alignas after you\u0026rsquo;ve removed the cross-core reads — do it before, and you can make things slower, as the padded row above shows.\n(Treat the absolute numbers as illustrative — they\u0026rsquo;re one machine, one payload size, threads unpinned on a consumer OS. The ratios are the durable result, and they\u0026rsquo;d be even more lopsided on a many-socket server where coherence misses cross a slower interconnect.)\nWhen SPSC is the wrong answer The speed comes entirely from the single-producer, single-consumer assumption, so the honest limits all follow from breaking it:\nMore than one producer or consumer. The moment two threads write the same cursor, you need a CAS loop and you inherit the ABA problem and a memory-reclamation strategy (hazard pointers, RCU). That\u0026rsquo;s a genuinely harder structure — the natural next post. Bursty or oversized consumers. A ring buffer applies backpressure by failing push when full. If your consumer can stall, you need a policy: drop, block, or grow. Silent blocking on the hot path is how a feed handler falls behind the market. You wanted a general queue. This isn\u0026rsquo;t std::queue. It\u0026rsquo;s bounded, single-type, and fastest when you also batch — draining everything available per wake rather than paying the synchronisation per item. If you\u0026rsquo;re reaching for this in anger rather than learning, use a vetted implementation — the SPSC queues in Rigtorp\u0026rsquo;s library, Folly\u0026rsquo;s ProducerConsumerQueue, or moodycamel — all of which apply exactly these tricks and more. The value in writing your own is that afterwards you can read theirs and know precisely why every line is there.\nFurther reading Charles Frasch, \u0026ldquo;SPSC Lock-free FIFO from the Ground Up\u0026rdquo; (CppCon 2023) — the talk that popularised the cached-cursor optimisation, with a companion repo worth following line by line. Herb Sutter, \u0026ldquo;atomic\u0026lt;\u0026gt; Weapons\u0026rdquo; (C++ and Beyond 2012) — still the clearest tour of acquire/release and why relaxed is enough for a single-writer cursor. Jeff Preshing, \u0026ldquo;Acquire and Release Semantics\u0026rdquo; — short, precise, and the mental model that makes the two release/acquire pairs above obvious. For the multi-producer sequel — CAS, ABA, and hazard pointers — Anthony Williams, C++ Concurrency in Action (2nd ed.), chapters 5–7. ","permalink":"https://hftengineer.com/posts/spsc-ring-buffer/","summary":"A single-producer/single-consumer ring buffer is the fastest way to move data between two threads — and the canonical low-latency interview question. We build one in C++, prove it correct with acquire/release ordering, and then watch a textbook false-sharing \u0026lsquo;fix\u0026rsquo; make it slower before the real optimisation takes it 14× faster. All numbers measured and ThreadSanitizer-clean.","title":"Lock-free SPSC ring buffer: the queue under every trading system"},{"content":"TL;DR Cache hierarchy. L1 ≈ 1 ns, L2 ≈ 4 ns, L3 ≈ 12 ns, RAM ≈ 80 ns. Code that fits in L1 is roughly two orders of magnitude faster than code that stalls on RAM. The unit of movement between levels is a 64-byte cache line. Branch prediction. A correct prediction is free; a mispredict throws away 15-20 cycles. Branchy code on random data can run 5× slower than the same code on sorted data. Sort, partition, or go branchless on hot paths. False sharing. Two threads writing different variables that happen to share a cache line will ping-pong the line between cores and run slower than a single-threaded version. Pad hot per-thread state to 64 or 128 bytes. These three ideas are the foundation of \u0026ldquo;mechanical sympathy\u0026rdquo; — the term Martin Thompson borrowed from Jackie Stewart for engineers who design code with the underlying machine in mind. They show up in nearly every HFT systems interview, and getting them right separates production-shaped low-latency code from textbook code.\nThe cache hierarchy Every level of memory between the registers and the disk is a cache for the level below it. On a modern x86-64 server CPU:\nLevel Size Latency Bandwidth Registers ~1 KB \u0026lt;1 ns huge L1d (per core) 32-48 KB ~1 ns ~1 TB/s L2 (per core) 256 KB - 1 MB ~4 ns ~500 GB/s L3 (shared) tens of MB ~12 ns ~200 GB/s RAM tens-hundreds of GB ~80 ns ~50 GB/s NVMe SSD TB ~50 µs ~5 GB/s The numbers are approximate and CPU-specific, but the ratios are stable: each level is roughly an order of magnitude bigger and slower than the one above. Register-to-RAM is about 100× slower than register-to-L1.\nEach level is ~10× bigger and slower than the one above registers\u0026lt;1 ns L1d~1 ns L2~4 ns L3 (shared)~12 ns RAM~80 ns the cliff register → RAM ≈ 100× slower. Bar length tracks latency. The CPU does not move data between levels in single bytes. The unit is a cache line, almost always 64 bytes on x86-64 and AArch64. When you read one byte from RAM, the CPU pulls a 64-byte chunk into L1 and the surrounding bytes are along for the ride.\nThat has two big consequences:\nSequential access is enormously faster than random access. Walking an array linearly hits the same cache line 64 / size-of-element times before pulling the next one. Walking a linked list typically hits a cold cache line every node. Locality matters more than algorithmic complexity for moderate n. A linear scan of an array can beat a tree lookup well past n = 10,000 because the tree\u0026rsquo;s pointer chasing burns cache miss after cache miss. Practical: keep hot data small and adjacent The cleanest mechanical-sympathy move is to make sure the data your hot loop touches fits in L1.\nA struct of arrays beats an array of structs when the loop only reads a couple of fields. (row[i].price over an array of bloated rows wastes most of every line; prices[i] packs prices densely.) Pre-allocate. Allocator metadata, scattered allocations, and pointer indirection trash spatial locality. For tree-shaped data, consider a flat layout (heap-as-array, B-trees with large fanout, succinct data structures). A real example: an order book\u0026rsquo;s hot loop is \u0026ldquo;given a tick, find the level\u0026rdquo;. The textbook answer is a TreeMap; the production answer is an array indexed by tick. The array version is O(1) and one cache line; the TreeMap is O(log n) and several pointer-chases per lookup. Deep-dive: Order-book data structures.\nBranch prediction The CPU pipeline is dozens of stages deep. By the time it knows whether a branch is taken, the next dozen instructions are already in flight. To avoid stalling, the CPU guesses which way the branch will go and speculates down that path. If the guess is right, no cost. If wrong, the pipeline gets flushed and the wrong-path work is thrown away — typically 15-20 cycles of penalty.\nPredictors are good. On a typical server workload, mispredict rates are well under 1%. But \u0026ldquo;good on average\u0026rdquo; is not the same as \u0026ldquo;good on your hot path\u0026rdquo;.\nThe classic demo is the famous Stack Overflow question \u0026ldquo;why is processing a sorted array faster than an unsorted array?\u0026rdquo; The code is a single branch inside a tight loop:\nfor (int i = 0; i \u0026lt; N; i++) { if (data[i] \u0026gt;= 128) sum += data[i]; } On random data, the branch is unpredictable — heads/tails. On sorted data, the branch settles into long runs of taken / not-taken and the predictor nails it. The runtime difference can be 5× or more.\nPractical moves Sort or partition data so the branch becomes predictable. Worth it when you process the same data many times. Go branchless. cmov, bit manipulation tricks, table lookups, or min/max intrinsics replace branches with data-dependent arithmetic. The compiler will often do this for simple cases (x = a \u0026lt; b ? a : b). Hoist branches out of the loop. If a branch depends on something invariant per iteration, put it outside. Vectorise. SIMD instructions process 4-16 elements per cycle without a branch per element; the comparison becomes a mask register, the conditional add becomes a masked add. The general principle: the hot path should have no surprising branches. Either the branch is predictable, or it isn\u0026rsquo;t there.\nFalse sharing This one bites everyone at least once.\nImagine two threads, each with its own counter. They never touch each other\u0026rsquo;s counter. Surely the increments are independent and scale linearly with cores?\nstruct Counters { volatile long a; volatile long b; }; Counters counters; // thread 1: counters.a++ // thread 2: counters.b++ In practice, this can run slower than a single thread doing both increments. The reason: a and b are adjacent in memory, so they sit on the same 64-byte cache line. The cache coherence protocol (MESI on x86) tracks ownership at the cache-line granularity. When thread 1 writes a, the line moves to its core in Modified state, invalidating the copy on thread 2\u0026rsquo;s core. When thread 2 writes b, the line moves back. Each write becomes a cache-line-bounce — tens of nanoseconds, easily.\nThis is false sharing. The variables are logically independent; the cache line makes them physically dependent.\nFalse sharing: independent vars, one cache line Core 1writes a Core 2writes b one 64-B line a b line ping-pongs between cores on every write a and b are logically independent; the cache line makes them physically shared. Fix: pad each to its own line (64 or 128 B) → zero bounces. Practical moves Pad hot per-thread state to a full cache line:\nstruct alignas(64) PaddedCounter { volatile long value; char padding[64 - sizeof(long)]; }; In Java, the same idea: insert dummy long fields, or use @Contended (since JDK 8, behind -XX:-RestrictContended).\nThe Disruptor\u0026rsquo;s ring buffer is the canonical example: head and tail cursors are each padded to 128 bytes (some CPUs prefetch adjacent lines, so 64 isn\u0026rsquo;t always enough). The result is that a single-producer/single-consumer pair never bounces a cache line between them.\nFalse sharing is invisible in code review. If a piece of \u0026ldquo;lock-free\u0026rdquo; code is mysteriously slower than a mutex version, false sharing is the first suspect.\nPutting it together These three ideas compose:\nA hot loop that fits in L1 and runs predictable branches over sequentially-laid-out data is roughly the fastest thing a CPU can do. The Disruptor, vectorised SQL engines (DuckDB), and any HFT order book share this DNA. A loop that fetches scattered data through mispredicted branches and writes back to a falsely shared line is the slowest. Naive hash-map iteration with poor locality, classic textbook linked-list code, and \u0026ldquo;I added a counter to track stats\u0026rdquo; patches all qualify. Mechanical sympathy isn\u0026rsquo;t an optimisation pass. It\u0026rsquo;s a way of looking at the data layout first, the algorithm second.\nInterview angle. All three come up constantly in HFT systems loops. Be ready to: estimate the cost of a cache miss in nanoseconds; explain why a sorted array can beat an unsorted one through the same code; and spot false sharing as the cause of \u0026ldquo;lock-free, but slower than a mutex.\u0026rdquo; Reasoning in concrete numbers is what separates a strong answer from a hand-wave.\nFurther reading Martin Thompson — \u0026ldquo;Mechanical Sympathy\u0026rdquo; — the blog and talks that popularised the term. https://mechanical-sympathy.blogspot.com/ Ulrich Drepper — \u0026ldquo;What Every Programmer Should Know About Memory\u0026rdquo; — long, dense, still the best single reference. https://www.akkadia.org/drepper/cpumemory.pdf Daniel Lemire\u0026rsquo;s blog — branchless tricks and SIMD demonstrations, often with measurable benchmarks. https://lemire.me/blog/ Agner Fog — \u0026ldquo;Optimizing software in C++\u0026rdquo; — practical tables of instruction latency, branch behaviour, and pipeline characteristics per microarchitecture. https://www.agner.org/optimize/ LMAX Disruptor whitepaper — false sharing, padding, and the single-writer principle in production. https://lmax-exchange.github.io/disruptor/disruptor.html ","permalink":"https://hftengineer.com/posts/mechanical-sympathy/","summary":"Three hardware ideas that decide whether your low-latency code is fast or pretending to be: how the cache hierarchy works, why branch prediction can change runtime by 5×, and how false sharing makes lock-free code slower than mutexes.","title":"Mechanical sympathy: cache, branches, false sharing"},{"content":"Who\u0026rsquo;s writing I\u0026rsquo;m a software engineer writing Java in production at a high-frequency trading firm. The day job is the kind of work where a 2 ms GC pause is a missed market and the design of a queue can be the difference between making and missing the open. I came to it from regular backend engineering and the interests below reflect that arc.\nWhat this blog is Deep dives on the systems, languages, and tools that make latency-sensitive software work — and the broader ecosystem they sit in. Some of it is JVM-internal (ZGC, JIT, Loom), some of it is data-shaped (DuckDB, ClickHouse, Iceberg), some of it is networking and OS (kernel-bypass, eBPF), some of it is the trading-domain mental models I wish I\u0026rsquo;d had on day one (order books, single-writer designs, market microstructure).\nPosts are deep, opinionated, and skewed toward what\u0026rsquo;s actually useful to know vs. what shows up in marketing material. Where I haven\u0026rsquo;t measured something I\u0026rsquo;ll say so. Where a \u0026ldquo;best practice\u0026rdquo; is mostly cargo-cult I\u0026rsquo;ll say that too.\nHow to navigate The Roadmap is a guided reading order. Foundations first, then five branches by domain (JVM, data, networking, trading, distributed systems), each layered from core to advanced. Recommended if you want to follow a path top-to-bottom. The Concepts at a glance page is a wider browse-mode index — short paragraphs on dozens of technologies with deep-dive links where they exist. Tag pages and the archive are linked in the navigation if you\u0026rsquo;d rather browse by topic or chronology. Cadence One or two longer posts per month, plus shorter pieces and the occasional \u0026ldquo;concepts at a glance\u0026rdquo; addition. Drafts often sit for a while before publication; if I\u0026rsquo;m writing about something it\u0026rsquo;s usually because I\u0026rsquo;ve actually been using it or have spent too long thinking about it.\nContact Feedback, corrections, and suggestions for topics are welcome — particularly the \u0026ldquo;I think you got this wrong because…\u0026rdquo; kind, those make the blog better. (A contact method will be added here later.)\n","permalink":"https://hftengineer.com/about/","summary":"\u003ch2 id=\"whos-writing\"\u003eWho\u0026rsquo;s writing\u003c/h2\u003e\n\u003cp\u003eI\u0026rsquo;m a software engineer writing Java in production at a high-frequency trading firm. The day job is the kind of work where a 2 ms GC pause is a missed market and the design of a queue can be the difference between making and missing the open. I came to it from regular backend engineering and the interests below reflect that arc.\u003c/p\u003e\n\u003ch2 id=\"what-this-blog-is\"\u003eWhat this blog is\u003c/h2\u003e\n\u003cp\u003eDeep dives on the systems, languages, and tools that make latency-sensitive software work — and the broader ecosystem they sit in. Some of it is JVM-internal (ZGC, JIT, Loom), some of it is data-shaped (DuckDB, ClickHouse, Iceberg), some of it is networking and OS (kernel-bypass, eBPF), some of it is the trading-domain mental models I wish I\u0026rsquo;d had on day one (order books, single-writer designs, market microstructure).\u003c/p\u003e","title":"About"}]