Lock-free SPSC ring buffer: the queue under every trading system

TL;DR

A single-producer/single-consumer (SPSC) ring buffer hands data between two threads with no locks and no allocation — a handful of instructions per operation. It’s the queue at the bottom of every feed handler.
Correctness rests entirely on two atomic cursors and acquire/release ordering. On ARM that ordering emits real barriers; on x86 it’s nearly free — which is exactly why the design deserves care.
Naively padding the cursors to “avoid false sharing” made my benchmark slower — because both threads still read the opposite cursor every operation, so two cache lines ping-pong instead of one.
Caching the opposite cursor removes those cross-core reads. Combined with padding it took the same logic from ~32 to ~440 M ops/s on an Apple M4 Pro.
The lesson: eliminate true sharing before you fix false sharing. Padding is not a magic word.

The one queue you can’t avoid

Every low-latency system has the same shape at its core. One thread pulls bytes off a socket and decodes them; another consumes those messages and does something useful — updates an order book, runs a strategy, writes to a journal. Between them sits a queue. On the hot path, that queue is almost always a single-producer, single-consumer ring buffer: the cheapest correct way for two threads to communicate.

It’s also the canonical interview question for these roles, and a good one: it’s small enough to write on a whiteboard, yet it forces you to say something precise about the memory model, about cache coherence, and about the difference between code that looks concurrent and code that is. This post builds one, proves it correct, and then measures three versions — because the interesting part isn’t the data structure, it’s what the hardware does with it.

A ring is an array and two cursors

The data structure is almost insultingly simple: a fixed array plus two indices. tail is where the producer writes next; head is where the consumer reads next. The buffer is empty when they’re equal and full when advancing tail would collide with head. Make the capacity a power of two and wrapping is a single & instead of a branch or a modulo.

The single-producer/single-consumer constraint is what buys the simplicity: because exactly one thread writes tail and exactly one writes head, neither index ever needs a compare-and-swap. Each side owns its own cursor outright. All that remains is making the two threads agree on when a written slot becomes visible — and that’s a memory-model question, not a locking one.

Why it’s correct without a lock

Here’s the whole engine. No mutex, no compare_exchange, just two atomics and the right ordering on each access.

template <class T, std::size_t Capacity>   // Capacity must be a power of two
class SpscRing {
  std::atomic<std::size_t> head_{0};       // written by consumer only
  std::atomic<std::size_t> tail_{0};       // written by producer only
  T buf_[Capacity];
  static constexpr std::size_t kMask = Capacity - 1;

 public:
  bool push(const T& v) {                  // producer thread
    const std::size_t t    = tail_.load(std::memory_order_relaxed);
    const std::size_t next = (t + 1) & kMask;
    if (next == head_.load(std::memory_order_acquire)) return false;  // full
    buf_[t] = v;                           // (1) write the slot
    tail_.store(next, std::memory_order_release);                     // (2) publish
    return true;
  }

  bool pop(T& out) {                       // consumer thread
    const std::size_t h = head_.load(std::memory_order_relaxed);
    if (h == tail_.load(std::memory_order_acquire)) return false;     // empty
    out = buf_[h];                         // (3) read the slot
    head_.store((h + 1) & kMask, std::memory_order_release);          // (4) free it
    return true;
  }
};

The correctness argument is two release/acquire pairs:

Producer → consumer (data). The producer writes the slot (1) then does a release store of tail (2). The consumer does an acquire load of tail before reading the slot (3). The release store synchronises-with the acquire load, so once the consumer sees the new tail, it is guaranteed to see the slot write. That is the entire reason this works.
Consumer → producer (free space). Symmetrically, the consumer’s release store of head (4) publishes “this slot is free” to the producer’s acquire load of head in the full-check. The producer can’t overwrite a slot until the consumer has finished reading it.

The loads of your own cursor are relaxed because you are its only writer — there’s no one to race with. That precision is the point of the memory-ordering arguments these interviews probe: every relaxed, acquire, and release here is load-bearing, and you should be able to say why each one is the weakest ordering that’s still correct.

This is also why I benchmarked on ARM. On x86’s strongly-ordered model, acquire-loads and release-stores compile to ordinary mov — the ordering is essentially free, and you can get the orderings wrong and still pass your tests. On ARM (Apple Silicon here) they lower to ldar/stlr, real acquire/release instructions. The model is weaker, so the barriers do visible work and mistakes actually surface.

I ran the consumer-and-producer threads against this under ThreadSanitizer for millions of items: no data races reported, every checksum correct. That’s the bar before you trust a lock-free structure — “it passed once” is not evidence.

Attempt 1: pad the cursors, and watch it get slower

Anyone who’s read about false sharing knows the next move. head_ and tail_ sit next to each other, so they likely share one 64- or 128-byte cache line. Two cores hammering the same line is the textbook false-sharing pathology. The fix is equally textbook: push each cursor onto its own cache line with alignas.

So I did, and benchmarked it: capacity 1024, 200 million uint32 items, producer and consumer on separate cores of an M4 Pro.

Layout	Throughput	Per op
naive (cursors share a line)	~32 M ops/s	~31 ns
padded (cursors on separate lines)	~21 M ops/s	~48 ns

Padding made it a third slower. The textbook fix backfired.

The reason is the part the textbook leaves out. False sharing hurts when two cores write the same line. But look again at push and pop: the producer reads head_ on every call, and the consumer reads tail_ on every call. Both cursors are genuinely shared — read by one core, written by the other, every single operation. That’s true sharing, not false sharing.

In the naive layout, head and tail live on one line. Both cores touch that one line every op, so exactly one line ping-pongs between the two caches. Splitting them onto separate lines means the producer now reads the consumer’s line and the consumer reads the producer’s line — two lines ping-pong instead of one. Padding doubled the coherence traffic. The fix was real; I’d just aimed it at the wrong problem.

The real fix: cache the other cursor

The traffic comes from reading the opposite cursor every operation. But you mostly don’t need a fresh value. The producer only needs head to check for full — and the buffer is rarely full. So cache it: keep a private, non-atomic copy of the opposite cursor and only refresh it from the shared atomic when the cached value says you’re stuck.

bool push(const T& v) {                    // producer thread
  const std::size_t t    = tail_.load(std::memory_order_relaxed);
  const std::size_t next = (t + 1) & kMask;
  if (next == head_cache_) {               // looks full? refresh once and recheck
    head_cache_ = head_.load(std::memory_order_acquire);
    if (next == head_cache_) return false; // genuinely full
  }
  buf_[t] = v;
  tail_.store(next, std::memory_order_release);
  return true;
}

head_cache_ is an ordinary std::size_t private to the producer; the consumer keeps a tail_cache_ of its own. The cached value is always conservative — a stale head can only make the producer think the buffer is fuller than it is, which costs a wasted refresh, never a correctness bug. With the cursors also padded onto separate lines, each shared line is now written by one core and read by the other only at the rare full/empty boundary. In steady state the cross-core reads nearly vanish.

Layout	Throughput	Per op
naive	~32 M ops/s	~31 ns
padded	~21 M ops/s	~48 ns
padded + cached cursor	~440 M ops/s	~2.3 ns

Same algorithm, same memory orderings, roughly 14× the throughput of the naive version and 20× the padded one — entirely from controlling which core owns which cache line. This is the whole of mechanical sympathy in one example: the instructions barely changed; the cache-coherence traffic is what moved.

Fix true sharing before false sharing. Padding only pays off once each cache line is owned by one core. Reach for alignas after you’ve removed the cross-core reads — do it before, and you can make things slower, as the padded row above shows.

(Treat the absolute numbers as illustrative — they’re one machine, one payload size, threads unpinned on a consumer OS. The ratios are the durable result, and they’d be even more lopsided on a many-socket server where coherence misses cross a slower interconnect.)

When SPSC is the wrong answer

The speed comes entirely from the single-producer, single-consumer assumption, so the honest limits all follow from breaking it:

More than one producer or consumer. The moment two threads write the same cursor, you need a CAS loop and you inherit the ABA problem and a memory-reclamation strategy (hazard pointers, RCU). That’s a genuinely harder structure — the natural next post.
Bursty or oversized consumers. A ring buffer applies backpressure by failing push when full. If your consumer can stall, you need a policy: drop, block, or grow. Silent blocking on the hot path is how a feed handler falls behind the market.
You wanted a general queue. This isn’t std::queue. It’s bounded, single-type, and fastest when you also batch — draining everything available per wake rather than paying the synchronisation per item.

If you’re reaching for this in anger rather than learning, use a vetted implementation — the SPSC queues in Rigtorp’s library, Folly’s ProducerConsumerQueue, or moodycamel — all of which apply exactly these tricks and more. The value in writing your own is that afterwards you can read theirs and know precisely why every line is there.

The one queue you can’t avoid#

A ring is an array and two cursors#

Why it’s correct without a lock#

Attempt 1: pad the cursors, and watch it get slower#

The real fix: cache the other cursor#

When SPSC is the wrong answer#

Further reading#

The one queue you can’t avoid

A ring is an array and two cursors

Why it’s correct without a lock

Attempt 1: pad the cursors, and watch it get slower

The real fix: cache the other cursor

When SPSC is the wrong answer

Further reading