TL;DR

  • A single-producer/single-consumer ring buffer is simple because each cursor has exactly one writer. Add a second producer and the load → write → store sequence races: two producers grab the same slot.
  • The fix is to claim a slot atomically with a compare-and-swap (CAS) on a shared position, then publish it through a per-cell sequence number — Vyukov’s bounded multi-producer/multi-consumer (MPMC) queue. The sequence number doubles as the producer/consumer handshake and rules out the ABA problem.
  • It’s correct (ThreadSanitizer-clean under 4 contending producers), but a CAS isn’t free: even 1 producer is ~3× slower than SPSC (~145 vs ~400 M ops/s on an M4 Pro).
  • Worse, adding producers makes it slower: 2 producers ≈ 15 M ops/s, 4 ≈ 6.7 M. They all CAS one counter, so that cache line becomes a serialization point. Lock-free is not the same as scalable.
  • The practical takeaway: on a hot path, prefer one SPSC queue per producer (sharding) or a single-writer design. Reach for a true MPMC queue only when you genuinely can’t avoid it.

Where the easy version stops working

The SPSC ring buffer in the previous post got its speed from one assumption: exactly one thread writes tail, exactly one writes head. No two threads ever fight over a cursor, so no operation needs more than a plain load and a release-store.

That assumption is also its boundary. Feed handlers often have one producer, but plenty of systems don’t: several worker threads writing results into one queue, multiple strategies posting orders to one gateway. The moment two threads call push, the SPSC design doesn’t just slow down — it corrupts.

Two producers and the naive race

Here’s SPSC’s producer side, stripped down:

const std::size_t t = tail_.load(std::memory_order_relaxed);
buf_[t] = v;                                  // write the slot
tail_.store(t + 1, std::memory_order_release);

Run that on two threads. Both load the same t. Both write buf_[t] — one value silently overwrites the other. Both store t + 1. One item is lost, and the count is now wrong forever. There was never a window where this was safe; the single-writer assumption was load-bearing.

To fix it, two producers must never receive the same slot. That means claiming a slot has to be atomic: read the current position and advance it in one indivisible step, so each producer walks away with a unique index. That is exactly what a compare-and-swap (CAS) does.

Claiming a slot with CAS

Give the producers one shared cursor, enqueue_pos, and the consumers another, dequeue_pos. Both are monotonic counters that only move forward. A producer claims a slot by reading enqueue_pos and trying to CAS it from pos to pos + 1: the winner owns position pos, and anyone who loses the race reloads and tries again with the new value. Consumers claim from dequeue_pos the same way. Because each successful CAS hands out a distinct, ever-increasing number, no two producers can ever walk away with the same position.

Those positions index a buffer of N cells (with N a power of two) by pos & (N - 1). Position 0 is cell 0, position N is cell 0 again, position N + 1 is cell 1, and so on: a cursor laps the buffer every N positions, so each cell is owned by exactly one position per lap.

The two shared cursorsEvery producer CASes the one enqueue_pos; every consumer the one dequeue_pos.P1P2two producers,one wins the CASenqueue_pos 1313 mod 8 = 5buffer of N = 8 cells0123456710 mod 8 = 2dequeue_pos 10C1C2two consumers, same raceEach cursor only moves forward, one winner per CAS. A cell serves one position per lap of N,so a freed cell's next owner sits N positions later, which is the +N in seq = pos + N.

Claiming with a CAS is safe, but it opens a gap. A producer that has claimed position pos hasn’t written its cell yet; it is holding a ticket. A bare cursor can’t tell a consumer “claimed but still empty” apart from “claimed and filled”, so a consumer reading the instant enqueue_pos moves would read garbage. We need a per-cell signal that flips only after the data is in place.

The handshake: per-cell sequence numbers

Vyukov’s bounded multi-producer/multi-consumer (MPMC) queue gives every cell its own atomic sequence number, and that one number does all the coordination: no separate “full” flag, no per-cell lock. The rule is the whole trick, so state it plainly: a cell’s seq is the position number it is currently waiting to serve.

Each cell starts with seq set to its own index: cell k has seq = k. For a cell currently serving position pos, two values matter:

  • seq == pos means the cell is empty and waiting for the producer of position pos.
  • seq == pos + 1 means that producer has published, so the cell is full and waiting for the consumer of position pos.

A producer at pos looks at cell pos & (N-1). If seq == pos, the cell is free and it is this producer’s turn: it claims the slot with a CAS that bumps enqueue_pos from pos to pos + 1, writes its data, then publishes by storing seq = pos + 1. The consumer mirrors it exactly: it waits for seq == pos + 1, claims via CAS on dequeue_pos, reads, and frees the cell by storing seq = pos + N.

That last store, seq = pos + N, is where the lapping fact earns its keep. The cell’s next owner is the producer one lap on, at position pos + N, and it will arrive looking for seq == pos + N. So freeing the cell just arms it for whoever comes next: set seq to the position they will present.

The cursor picture used N = 8; drop to a tiny N = 4 and trace one cell’s seq across two laps:

One cell's seq across two laps (N = 4, cell 0)01458writefree +Nwritefree +NAmber = full, plain = empty. Freeing jumps seq by N, so it never repeats a value (no ABA).

That forward-only progression is also what rules out the ABA problem: the classic lock-free hazard where a value goes A → B → A between two reads, so a bare compare-and-swap sees the original A back in place and wrongly concludes nothing changed. Here a cell can never look the way it did a lap ago, so a stalled producer or consumer still holding an old seq just fails its compare and retries. In code, the producer side is:

bool push(const T& v) {
  Cell* cell;
  std::size_t pos = enqueue_pos_.load(std::memory_order_relaxed);
  for (;;) {
    cell = &buf_[pos & kMask];
    const std::size_t seq = cell->seq.load(std::memory_order_acquire);
    const std::intptr_t dif = (std::intptr_t)seq - (std::intptr_t)pos;
    if (dif == 0) {                            // free & our turn — try to claim
      if (enqueue_pos_.compare_exchange_weak(pos, pos + 1,
                                             std::memory_order_relaxed))
        break;                                 // won the ticket
    } else if (dif < 0) {
      return false;                            // full
    } else {
      pos = enqueue_pos_.load(std::memory_order_relaxed);  // lost race — reload
    }
  }
  cell->data = v;
  cell->seq.store(pos + 1, std::memory_order_release);      // publish
  return true;
}

pop is the mirror image. The release-store of seq and the matching acquire-load on the other side give the same synchronises-with relationship the SPSC version relied on: seeing the new sequence number guarantees seeing the data written before it.

Look at the numbers

Correct is not the same as fast. Capacity 1024, single consumer, producers scaled from one to four on an M4 Pro:

QueueProducersThroughputPer op
SPSC (previous post)1~400 M ops/s~2.5 ns
MPMC (Vyukov)1~145 M ops/s~6.9 ns
MPMC (Vyukov)2~15 M ops/s~65 ns
MPMC (Vyukov)4~6.7 M ops/s~150 ns

Two results worth sitting with. First, even with a single producer, the MPMC queue is roughly 3× slower than SPSC. That’s the cost of the CAS and the extra atomic per cell — generality isn’t free, so don’t pay for multi-producer support you don’t use.

Second, and more important: adding producers makes throughput collapse, not climb. Going from one producer to four made it more than 20× slower.

Why it doesn't scale: every producer fights one counterP1P2P3P4enqueue_posone atomic · CAS + retrymeasured (M4 Pro)1 producer145 M/s2 producers15 M/s4 producers6.7 M/smore producers → fewer ops.Lock-free guarantees progress, not throughput — a contended counter serialises everyone.

Every producer CASes the same enqueue_pos. That one counter lives on one cache line, and a successful CAS needs that line in exclusive state — so the line ping-pongs between cores, and under contention most CAS attempts fail and retry. “Lock-free” only promises that some thread always makes progress; it says nothing about throughput. A heavily contended lock-free counter can be slower than a plain mutex, because at least a mutex stops the losers from spinning on the same line.

The real lesson: don’t share the counter

The fastest multi-producer queue is often not a multi-producer queue. “Lock-free” guarantees that some thread makes progress — not that throughput scales. A contended counter serialises everyone, so the win is to remove the contention, not to make the contended op cleverer.

If you control the producers, the better hot-path designs sidestep the shared counter entirely:

  • Shard. Give each producer its own SPSC queue and let the consumer round-robin or drain them. No shared write cursor, so each producer keeps SPSC’s ~2 ns/op — you trade a little consumer complexity for an order of magnitude.
  • Funnel to a single writer. The single-writer principle is exactly this argument generalised: route mutations to one thread and the contention disappears by construction. It’s why the Disruptor and most matching engines are built the way they are.
  • Batch. If you must share a counter, claim a range of slots per CAS instead of one. Amortising the contended operation across many items is the single biggest lever left.

A true MPMC queue is the right tool when producers are genuinely independent, bursty, and not on the critical microsecond path — control-plane work, logging, task distribution. For that, reach for a vetted implementation: Folly’s MPMCQueue and moodycamel::ConcurrentQueue are both descendants of this design with many more refinements. Writing the Vyukov queue once is how you earn the judgement to know when not to use it.

Further reading

  • Dmitry Vyukov, “Bounded MPMC queue” (1024cores.net) — the original writeup of the algorithm above, with the full pop and the rationale for every memory order.
  • Fedor Pikus, “Live Lock-Free or Deadlock” (CppCon 2017) — why lock-free is about progress guarantees, not speed, and how contention erases the difference.
  • Paul McKenney, Is Parallel Programming Hard? — the chapter on counting is the canonical treatment of why shared counters don’t scale and what to do instead.
  • For the single-producer foundation this builds on, the SPSC ring buffer post.