TL;DR
- One thread mutates a piece of state. All other threads send messages to it via a queue. That’s the entire pattern.
- It sidesteps the lock-free / memory-ordering / cache-contention swamp by construction. The mutator has the only writable view, so there is nothing to coordinate.
- Throughput on a single core is enormous — tens of millions of operations per second is routine — and tail latency is far better than a contended mutex or a true lock-free design.
- The canonical implementation is a ring buffer fed by a sequencer (LMAX Disruptor) in front of the writer thread. Most production matching engines are this shape.
The premise
Most concurrency bugs and most concurrency cost come from multiple threads writing the same memory. Reads can be cheap, even free; concurrent writes force the cache coherence protocol, the memory model, and your locking discipline all into the hot path.
The single-writer principle takes that as a hint: don’t have multiple writers. Pick one thread, give it sole authority over a piece of state, and route all updates through it.
In the matching-engine case: one thread per symbol owns the order book. The TCP listener, the replication thread, and the strategy threads all send — they don’t write the book themselves.
Why one writer is faster than zero
One writer beats zero. A single-writer design consistently outperforms a “properly” lock-free or wait-free multi-writer design on most realistic workloads — for three reasons.
No cache-line bouncing. When two cores write the same cache line, the line ping-pongs between their L1s. Each write becomes a tens-of-nanoseconds round trip. With one writer, the line lives in one core’s L1 and stays there.
No memory-ordering overhead. Lock-free multi-writer code is full of compare_and_swap loops, acquire/release fences, and retry logic. A single writer can use plain stores; the only fence needed is the one between the writer and its consumers, paid once per batch instead of once per operation.
Predictable execution. A single thread mutating its own state has no contention. No retries, no priority inversion, no “lock got expensive under load”. The 99.99th percentile looks like the median.
The cost: serialisation
The obvious downside is that one thread becomes the bottleneck. If the writer maxes a CPU core, no amount of additional threads helps.
This sounds bad until you check the numbers. A modern x86 core can do >10M order-book updates per second in a single thread, with cache-aware data structures and no allocations on the hot path. Most workloads are bounded by network and disk, not by the CPU cost of mutating state.
When the writer is the bottleneck, you partition: one writer per symbol, one writer per shard, one writer per partition key. Writers are independent — no shared mutable state means no shared cost.
The Disruptor: making it concrete
The pattern needs three pieces: a queue, a sequencer, and a writer thread.
The queue is a fixed-size ring buffer of pre-allocated entries. Producers don’t allocate — they claim a slot and fill it in place. The writer reads sequentially from the same buffer.
The sequencer assigns a monotonic sequence number to each producer’s claim, ensuring a total order. With a single producer it’s a counter increment; with multiple producers it’s a compare_and_swap. Either way, the writer sees a single ordered stream.
The writer thread drains the buffer in order, applying each entry to the state. It can batch — process all available entries before re-checking the head cursor — which amortises any per-batch cost (logging, replication, fence).
The original LMAX Disruptor adds two refinements:
- Cache-line padding on the head and tail cursors so they don’t false-share.
- Wait strategies — busy-spin, yielding, sleeping — letting you trade CPU usage for wakeup latency.
In Java, the LMAX library or JCTools’ MPSC / SPSC queues both implement this. In C++, boost::lockfree::spsc_queue does the simple case; production HFT codebases tend to roll their own with platform-specific cache-line-size constants.
When to use it
Reach for single-writer when:
- State has logical ownership. A symbol’s order book belongs to one thread by nature; a session’s state belongs to that session’s thread.
- Throughput per partition is bounded. If you can hit your latency budget on one core, single-writer is the simplest correct answer.
- You need predictable tail latency. The pattern eliminates whole classes of contention behaviour.
- Replay-determinism matters. A single writer with an ordered input stream is trivially deterministic — replay the same inputs and get the same state. This is a huge testing and audit win.
Don’t reach for it when:
- Reads dominate. A read-mostly cache doesn’t need a single writer; copy-on-write or RCU is cheaper.
- State is global and entirely write-shared. A counter incremented by every thread can’t be partitioned by ownership; use atomics and accept the cost.
How it shows up in interviews
must-know
The single-writer principle is the textbook answer to a few common HFT systems-design questions:
- “How would you scale this matching engine across cores?” — One writer per symbol, sharded; no cross-symbol consistency requirements means no cross-thread coordination.
- “How do you avoid locks on the hot path?” — You don’t avoid them; you don’t need them. Ownership replaces locking.
- “How do you make this deterministic for replay?” — One ordered input stream, one writer, deterministic state transitions.
Default answer. Internalise the pattern as your reflex: “I’d put a Disruptor in front of a single-threaded writer per shard.” Then justify it with the cache-line and memory-ordering arguments above. That one sentence plus its reasoning covers most of the questions in this section.
Further reading
- LMAX Disruptor — design paper — the canonical writeup, with the cache-line padding diagrams and the original benchmarks. https://lmax-exchange.github.io/disruptor/disruptor.html
- Martin Thompson — “Single Writer Principle” — the named-and-explained version. https://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html
- JCTools — production-quality MPSC / SPSC queues for the JVM. https://github.com/JCTools/JCTools
- Order-book data structures — for the matching-engine application of the same pattern. /posts/order-book-data-structures/
