Order-book data structures: the ADT, the array ladder, and the tradeoffs

TL;DR

A limit order book is a tiny, well-defined abstract data type (ADT): add, cancel, modify, match, bestBid/bestAsk. The interesting work is hitting all of them in O(1) without allocating.
A TreeMap<Price, Level> is correct, easy, and slow. Use it for prototypes; do not put it on the order path.
The fast canonical design is an array indexed by tick plus a bitmap of occupied levels, with a doubly-linked FIFO in each cell and a hash map of order-id to node for O(1) cancel.
Threading is boring on purpose: single-writer per book, fed by a sequencer (Disruptor or JCTools), shard by symbol across cores.
Real production books deviate in honest ways: hidden/iceberg orders, peg orders, self-trade prevention, audit/journal, replay determinism. The core data structure doesn’t change; these sit on top of it.

1. Foundations

1.1 The limit order book

A limit order book (LOB) is a per-symbol record of every resting limit order. It has two sides:

Bids: buy orders, sorted by price descending — the highest bid is on top.
Asks (or offers): sell orders, sorted by price ascending — the lowest ask is on top.

The gap between the best bid and best ask is the spread. There is no “the price” of a stock at the exchange level; there is a top-of-book bid, a top-of-book ask, and the last trade price.

Within each side, orders at the same price are ordered by price-time priority (FIFO): the first order to arrive at that price gets filled first. This rule is non-negotiable. Most of the design decisions in this post fall out of taking it seriously.

1.2 L1, L2, L3 market data

This terminology has nothing to do with CPU caches. It describes how much of the book the exchange exposes to you:

L1 (top-of-book): best bid, best ask, last trade. One line per side.
L2 (price-aggregated): every price level on each side, with the aggregate size at that level. No per-order detail.
L3 (full order book): every individual order with its identifier, price, size, and timestamp. You can reconstruct the FIFO queues exactly.

Inside the matching engine you always have L3: it is generating the data, after all. What it publishes is a policy choice. Most US equity venues publish L2 to the public feed and L3 only to certain participants. Crypto exchanges have happily published L3 since the start.

For this post, “the book” means the L3 internal structure inside a matching engine.

1.3 Order types

Restricting to the operations every venue supports:

Limit order: buy/sell at price P or better. If unmatched, it rests on the book.
Market order: buy/sell up to N shares at any price. Walks the opposite side until it is filled or the side is empty.
IOC (Immediate-or-Cancel): like a limit order, but anything that does not match immediately is cancelled, never rested.
FOK (Fill-or-Kill): match the entire quantity immediately or do nothing.
Peg orders: track a reference price (e.g., midpoint, best bid). Conceptually a limit whose price is rewritten on every National Best Bid and Offer (NBBO) change.

Hidden, iceberg, stop, and conditional orders also exist; they are a layer above the core match. We will come back to them in the production-deviations section.

1.4 Operations the book must support

The matching engine has to make all of these work:

Op	Description
`add(order)`	Place a new resting order at its price.
`cancel(orderId)`	Remove a resting order by id.
`modify(orderId, ...)`	Change price/size. Rule of thumb: size-down keeps queue priority, size-up or price-change loses it (cancel/replace).
`match(incoming)`	An aggressive (crossing) order arrives; consume the opposite side level-by-level under price-time priority.
`bestBid()` / `bestAsk()`	Top of book.
`snapshot(depth)`	Top N levels per side, for quoting and market data.

Everything else is a flavour of these.

2. The ADT, precisely

Before we discuss any data structure, fix the abstract operations and target complexities. This is the single most important framing move.

LimitOrderBook:
  add(orderId, side, price, size)            -> Result   // O(1) target
  cancel(orderId)                            -> Result   // O(1) target
  modify(orderId, newPrice, newSize)         -> Result   // O(1) on size-down; cancel/replace otherwise
  match(side, size, limitPrice)              -> Fills    // O(F) where F = number of fills
  bestBid()                                  -> Price    // O(1)
  bestAsk()                                  -> Price    // O(1)
  snapshot(depth)                            -> Levels   // O(depth)

Two things to notice:

match is O(F) — proportional to the number of fills it generates. You cannot do better; you have to actually pop each filled order. The bar is “no extra log factor on top of F.”
cancel is keyed by orderId, not by (price, position). This is what kills the obvious “just keep a sorted list per side” design. You need an external id index that maps an order id to an internal node in O(1).

The complexity targets above are achievable. The hard ones to get to O(1) are add and bestBid/bestAsk after a top-of-book consumption. Those are where the data-structure choice actually matters.

3. Approach 1 — sorted map (`TreeMap` / `std::map`)

The dictionary-correct design. Two ordered maps, one per side, keyed by price; each value is the price level.

NavigableMap<Long, PriceLevel> bids = new TreeMap<>(Comparator.reverseOrder());
NavigableMap<Long, PriceLevel> asks = new TreeMap<>();

PriceLevel holds the FIFO queue (a doubly-linked list of orders). Add a separate HashMap<Long, OrderNode> for cancel-by-id.

Complexities

add: O(log P) to find or insert the price-level entry, then O(1) to append to the level’s tail. P = number of distinct price levels.
cancel: O(1) via the id index; if the level becomes empty, O(log P) to remove it.
bestBid / bestAsk: firstEntry() — TreeMap exposes this in O(log n) worst case but the JDK keeps a cached pointer to the first/last entry, so it is effectively O(1). Do not rely on that detail without naming the assumption.
match: walk the first entry’s queue, fill, remove the level when empty.

Why it is correct but slow

Cache misses everywhere. A red-black tree is a graph of small node objects scattered across the heap. Every comparison is a load that misses L1 and probably L2.
Log factor is real. With ~1000 active levels, log2(P) ~ 10 comparisons per add/remove. Each is a price compare that touches a different cache line. An L1 miss is ~4 ns, an L2 miss ~12 ns; you can blow your entire latency budget chasing pointers.
GC churn. Every empty level you remove is garbage. Every new level you create is an allocation. In Java, steady-state allocations on the order path are forbidden; your latency tail explodes the moment a young-gen collection runs.
Pointer chasing on match. Each fill is a load, a compare, a write, a possible remove.

When it is the right choice

A TreeMap book is fine if your throughput is “a few thousand events per second” and you do not care about microsecond tails. A risk system that consumes a TBT feed and just needs a current top-of-book? Sure. A reference simulator? Sure. The actual matching engine on the order path? No.

4. Approach 2 — array-indexed price ladder (the canonical fast design)

This is the one. Conceptually it is a flattened TreeMap where you have replaced the search tree with a direct address calculation.

4.1 Prices are integers on a tick grid

Exchanges do not trade at arbitrary real numbers. They trade in discrete ticks: $0.01 for US equities above $1, $0.0001 for many crypto pairs, etc. So a price is tick = (price - minPrice) / tickSize, an integer, and the integer range is bounded.

For US equities trading $50–$200 at a 1-cent tick, that is 15,000 ticks per symbol. Most symbols use a much narrower active band; the matching engine might keep an array sized for the full band (or expand on demand).

4.2 The structure

levels[]      : array of PriceLevel, indexed by tick
occupancy[]   : long[] bitmap, bit `tick` set iff levels[tick] is non-empty
bestBidTick   : cursor pointing at the highest occupied bid tick
bestAskTick   : cursor pointing at the lowest occupied ask tick
orders        : HashMap<Long, OrderNode> for cancel-by-id

Each PriceLevel is a struct with head, tail and aggregate size. Each OrderNode is an intrusive doubly-linked list node.

4.3 Operations and complexities

add(order): compute tick = (price - basePrice) / tickSize. levels[tick].appendTail(order). Set occupancy[tick / 64] |= (1L << (tick % 64)). Update bestBidTick/bestAskTick if this beats the current top. O(1).
cancel(orderId): lookup the node in the id index, unlink from its level’s list. If the level became empty, clear its bit; if it was the top of book, scan the bitmap for the next occupied tick. O(1) in the steady case; the bitmap scan is amortised O(1) (see next section).
match(side, qty): starting from bestAskTick (or bestBidTick), pop orders from the head of the level until the quantity is exhausted or the level is empty. If the level empties, advance the cursor by scanning the bitmap. O(F) in the number of fills, plus O(1) per emptied level for the bitmap update.
bestBid / bestAsk: read the cursor. O(1).

4.4 The bitmap trick

After a match consumes a level, the cursor needs to advance to the next occupied tick on the same side. A linear scan over levels[] is fine when the price ladder is dense; it is awful when it is sparse (think after-hours, illiquid options, a stop run that hollows out the book).

A bitmap of occupancy solves this. Pack one bit per tick into a long[]. Now “next occupied tick at or below K” reduces to:

int idx = K >>> 6;
int bit = K & 63;
long word = occupancy[idx] & ((1L << (bit + 1)) - 1); // mask off bits above K
while (word == 0L) {
    if (--idx < 0) return NONE;
    word = occupancy[idx];
}
int leadingZeros = Long.numberOfLeadingZeros(word);
return (idx << 6) + (63 - leadingZeros);

Long.numberOfLeadingZeros is a single hardware lzcnt (or clz) instruction on any modern x86-64 / ARM. Each iteration of the loop tests 64 ticks. With a 16,384-tick ladder, the worst-case is 256 lzcnts — and the worst case is wildly atypical. In steady state the cursor moves at most a handful of bits per match.

This is what people mean when they call this design “mechanically sympathetic.” You have replaced graph traversal with linear memory access and a single-cycle CPU instruction.

The whole trick is turning a search into an instruction. A balanced tree finds the next price level in O(log n) pointer-chasing cache misses; the occupancy bitmap finds it in one lzcnt over a word that covers 64 levels. Same answer, but one is graph traversal and the other is a single cycle on data that’s already in cache.

4.5 Memory cost

The cost is the array. For 16,384 ticks per symbol, with a 64-byte PriceLevel struct, that is 1 MB per book — fine. For ~10,000 symbols on one box, you are talking 10 GB of pre-allocated levels, most of them empty. That is also fine; the empty cells are just zeroes in resident memory (and only the touched pages get demand-paged). What kills you is if your tick range is unbounded: long-dated options, perpetuals with extreme price ranges, or any pricing where the level count is genuinely unbounded.

When the array gets too big you fall back to Approach 3.

5. Approach 3 — hash map by price + linked list per level

HashMap<Long, PriceLevel> bids;   // key = tick
HashMap<Long, PriceLevel> asks;   // key = tick

Drop the array; key the levels by tick into a hash map. Tracking best-bid / best-ask is now harder: you cannot just walk a bitmap. Two flavours:

Pair the hash with a sorted structure (ordered set, skiplist, or RB-tree) of occupied prices. bestBid is prices.first(). Cancel that empties a level removes from the set. This is TreeMap-with-extra-steps, with add/remove cost O(log P).
Maintain only a cursor and accept that finding a new top-of-book after the current one empties is a linear hash scan. Awful.

This design’s only real argument over Approach 2 is when the price range is unbounded. You pay:

Hash collisions on every operation. Even with open addressing and a good hash, you get cache misses on the probe chain.
No spatial locality. The next level after 100.05 is somewhere else in memory.
O(log P) on add/cancel if you maintain a sorted price set, defeating the design’s point.

You see it in the wild for crypto perp books that span 4+ orders of magnitude in price, or for fixed-income where there is no single tick. For everything else, the array+bitmap wins.

6. Approach 4 — hash map + heap of price levels

Replace the sorted set with a binary heap keyed by price. add is O(log P) (push if the level is new), cancel is O(1) (lazy deletion), bestBid is O(1) (heap top), match is O(log P) per consumed level (heap pop).

This is essentially what you get if you naively reach for PriorityQueue in Java. It is rarely the right answer for an order book:

Heap nodes are pointer-chased. Same problem as TreeMap.
Lazy deletion creates garbage at the top of the heap that you have to skip over.
match does O(log P) per level, not per fill. Compared to Approach 2’s true O(F), this is strictly worse on heavy crosses.

The case for it is conceptual: a heap is the “I want the min and I do not care about the order of everything else” structure. If you only ever query top-of-book and never sweep deep, it’s defensible. In a real matching engine the deep sweeps do happen — opening auctions, stop runs, large institutional fills — so this design is brittle exactly when you need it most.

7. Per-order data and the order-id index

All four approaches need an orderId -> internalNode index, so cancellations are O(1).

In Java the straight choice is HashMap<Long, OrderNode>. The autoboxing on every lookup is a sin on the hot path, so you reach for Eclipse Collections LongObjectHashMap or Agrona Long2ObjectHashMap. Both are open-addressing primitive hash tables with linear probing — they keep the keys in a flat long[] and the values in a flat Object[], so a lookup is one cache miss on the probe word and one on the value, not 3–4 like HashMap.Node.

A subtle point: order ids are usually monotonic per-session, not globally hash-friendly. Without a fast scrambler the modulo distribution can be terrible. Apply a Wang or Splittable64 mixer on insert. Eclipse Collections does this for you.

// Sketch
LongObjectHashMap<OrderNode> orders = new LongObjectHashMap<>(initialCapacity);

OrderNode node = orders.get(orderId);  // ~25 ns warm
node.unlinkFrom(level);                // O(1) intrusive list op
orders.removeKey(orderId);

Sizing matters. Pre-allocate to 2× peak active orders so you never rehash on the order path. Rehashing is allowed during warmup, never afterwards.

8. Cache and memory layout

The array+bitmap design is fast because of how it interacts with the memory hierarchy. If you do not get the layout right, none of the algorithmic wins materialise.

8.1 Object layout for `Order` and `PriceLevel`

In Java pre-Valhalla, every object has a 12-byte (compressed-oops) or 16-byte (no compressed oops) header — mark word + class pointer — followed by your fields, padded to 8 bytes. Fields are laid out by the JVM in size-descending order to fill gaps.

A practical OrderNode looks like:

final class OrderNode {
    long orderId;       // 8
    long sizeRemaining; // 8
    int  tick;          // 4 (price as integer tick)
    byte side;          // 1 (BUY/SELL)
    // 3 bytes pad
    OrderNode prev;     // 4 (compressed oop) or 8
    OrderNode next;     // 4 or 8
    PriceLevel level;   // 4 or 8 (parent pointer for fast unlink)
}

Total ~48 bytes with compressed oops. That is less than one cache line (64 bytes), so traversing the FIFO is a single cache miss per node in the worst case and zero misses if the prefetcher has caught on. Order matters: keep prev/next adjacent; the linked-list walk is the hot path.

PriceLevel is similar — headOrder, tailOrder, aggregate size, count, level metadata — and you want it dense, so walking a level touches as few cache lines as possible.

8.2 Allocation discipline

Rule: do not allocate on the order path. Any new that runs at >1k/s puts pressure on TLABs and eventually triggers young-gen GC. Even with ZGC’s sub-ms pauses, the throughput hit on the matching thread shows up in your tail latency.

You meet the rule with object pooling. Pre-allocate OrderNode instances at startup, into a free-list. When an order arrives, pool.acquire(); on cancel/fill, pool.release(node). The intrusive list nodes never become garbage. Same trick for PriceLevel if you grow them dynamically (you usually don’t — the array is sized once).

final class OrderPool {
    private OrderNode head;
    OrderNode acquire() {
        OrderNode n = head;
        if (n == null) return new OrderNode();   // only at warmup
        head = n.next;
        n.next = null; n.prev = null;
        return n;
    }
    void release(OrderNode n) {
        n.next = head;
        head = n;
    }
}

8.3 Off-heap, `Unsafe`, Foreign Function & Memory (FFM)

For the hottest of hot fields — best-bid/ask cursors, the occupancy bitmap — some shops go off-heap with sun.misc.Unsafe (deprecated but still everywhere) or, in modern Java, the Foreign Function and Memory API (FFM). The wins are:

Predictable layout. MemorySegment gives you byte[]-like control without the Java object header.
No card marks. Off-heap writes do not touch GC card tables.
NUMA pinning. With FFM you can MemorySegment.allocate(...) against a specific arena bound to a CPU socket.

The cost is Java ergonomics — no helpful Objects, no equals, no toString. Most engines keep the book on-heap (the live data the matching loop walks) and put cross-thread mailboxes (Disruptor ring buffers, journaling buffers) off-heap. That is the cleanest split.

9. Threading model

The textbook answer here is short and correct: single-writer per book.

9.1 Why a single writer wins

A book’s hot path is a tight loop: read an event from a queue, mutate the FIFO at one or two price levels, possibly emit a fill, possibly emit an L2 update. The instructions are cheap. The contention is what kills you.

If two threads can mutate the book, you need locks. Locks mean cache-line bouncing between cores, kernel-mediated wait/wake when contended, and unbounded latency tails. Even lock-free CAS loops still cause cache-coherency traffic and unpredictable retries.

Martin Thompson’s “single-writer principle” puts numbers on it: a contended lock around a single counter ran 393× slower than a single thread doing the same updates. The order book is not a counter — it is hundreds of fields per event — and the multiplier is worse.

9.2 Sequencer + queue

The standard topology:

gateways (N producers) ---> sequencer ---> book thread ---> outbound publisher
                                |
                                v
                            journaler

A sequencer assigns a monotonically-increasing sequence number to every input event and writes it into a ring buffer (Disruptor) or multi-producer/single-consumer (MPSC) queue (JCTools MpscArrayQueue, Aeron). The book thread consumes events one at a time and processes them sequentially. The journaler reads the same buffer position-by-position and writes to disk for replay.

The Disruptor’s specific contribution is that it gives you a single ring buffer with multiple consumers ordered by dependency (e.g., journal → replicate → process → publish), and it does so without locks by handing each consumer its own position cursor on its own cache line. exchange-core, the open-source matching engine, is structured exactly this way.

sequenceDiagram
  participant G as Gateway (N)
  participant S as Sequencer (Disruptor RB)
  participant J as Journaler
  participant R as Replicator
  participant B as Book thread
  participant P as Publisher
  G->>S: Encoded order event
  S->>J: seq N
  S->>R: seq N
  S->>B: seq N
  J-->>S: durable
  R-->>S: replicated
  B->>B: mutate book
  B->>P: fills + L2 deltas
  P->>G: ack

9.3 Multi-symbol partitioning

One book, one thread is great. The exchange has thousands of symbols. You partition: hash symbol -> shard, each shard is a thread, each shard owns its own books and its own ring buffer. Cross-symbol orders are rare; when they exist (multi-leg, basket orders), they ride on a special protocol that sequences across shards, usually by routing through a coordinator.

This is what exchange-core does: pipelined multi-core processing where each core owns either a stage of the pipeline (risk → match → settle) or a shard of the symbol space. 5M ops/sec on commodity hardware comes from getting both the per-thread design and the cross-thread topology right.

flowchart LR
  G1[Gateway 1] --> SQ[Sequencer]
  G2[Gateway 2] --> SQ
  G3[Gateway 3] --> SQ
  SQ -->|hash by symbol| BK1[Book thread
shard 0]
  SQ --> BK2[Book thread
shard 1]
  SQ --> BK3[Book thread
shard 2]
  BK1 --> PUB[Publisher]
  BK2 --> PUB
  BK3 --> PUB
  SQ --> JNL[(Journal)]
  SQ --> REP[Replicator]

10. Reference Java implementation

What follows is a stripped reference implementation of the array+bitmap design. It compiles and runs as the spine of a matching engine — no risk checks, no journaling, no fancy order types — just enough to show the data structure and the hot loop.

package book;

import java.util.ArrayDeque;
import java.util.Deque;

public final class ArrayLadderBook {

    static final byte BUY = 1, SELL = -1;

    // Pre-sized ladder. Tick 0 = basePrice; tick TICKS-1 = basePrice + (TICKS-1)*tickSize.
    private static final int TICKS = 1 << 14;       // 16,384 levels
    private static final int WORDS = TICKS >>> 6;   // 256 longs

    // Ladder
    private final PriceLevel[] levels = new PriceLevel[TICKS];
    private final long[] occBids = new long[WORDS]; // bit set => occupied bid level
    private final long[] occAsks = new long[WORDS];

    // Cursors
    private int bestBidTick = -1;
    private int bestAskTick = TICKS;

    // Order id -> node
    private final OrderNode[] table;
    private final int tableMask;

    // Pool
    private OrderNode pool;

    public ArrayLadderBook(int orderTableCapacityPow2) {
        for (int i = 0; i < TICKS; i++) levels[i] = new PriceLevel(i);
        this.table = new OrderNode[orderTableCapacityPow2];
        this.tableMask = orderTableCapacityPow2 - 1;
    }

    // ---- public API ----------------------------------------------------------

    public int bestBidTick() { return bestBidTick; }
    public int bestAskTick() { return bestAskTick == TICKS ? -1 : bestAskTick; }

    public void add(long orderId, byte side, int tick, long size) {
        // Cross check: aggressive order? Match before resting.
        long remaining = match(side, tick, size);
        if (remaining == 0) return;
        rest(orderId, side, tick, remaining);
    }

    public boolean cancel(long orderId) {
        OrderNode n = lookup(orderId);
        if (n == null) return false;
        unlinkFromLevel(n);
        evict(orderId);
        release(n);
        return true;
    }

    // Match `incomingSide` against the opposite side, up to `limitTick`, for `qty`. Returns leftover.
    public long match(byte incomingSide, int limitTick, long qty) {
        if (incomingSide == BUY) {
            while (qty > 0 && bestAskTick < TICKS && bestAskTick <= limitTick) {
                qty = consume(levels[bestAskTick], qty);
                if (levels[bestAskTick].head == null) clearAskBit(bestAskTick);
            }
        } else {
            while (qty > 0 && bestBidTick >= 0 && bestBidTick >= limitTick) {
                qty = consume(levels[bestBidTick], qty);
                if (levels[bestBidTick].head == null) clearBidBit(bestBidTick);
            }
        }
        return qty;
    }

    // ---- internals -----------------------------------------------------------

    private void rest(long orderId, byte side, int tick, long size) {
        OrderNode n = acquire();
        n.orderId = orderId; n.side = side; n.tick = tick; n.sizeRemaining = size;
        levels[tick].appendTail(n);
        if (side == BUY) {
            setBidBit(tick);
            if (tick > bestBidTick) bestBidTick = tick;
        } else {
            setAskBit(tick);
            if (tick < bestAskTick) bestAskTick = tick;
        }
        index(orderId, n);
    }

    private long consume(PriceLevel level, long qty) {
        OrderNode n = level.head;
        while (n != null && qty > 0) {
            long take = Math.min(qty, n.sizeRemaining);
            n.sizeRemaining -= take;
            level.aggSize -= take;
            qty -= take;
            // emit fill (orderId, take, tick) -- omitted
            if (n.sizeRemaining == 0) {
                OrderNode next = n.next;
                level.head = next;
                if (next == null) level.tail = null; else next.prev = null;
                evict(n.orderId);
                release(n);
                n = next;
            }
        }
        return qty;
    }

    private void unlinkFromLevel(OrderNode n) {
        PriceLevel level = levels[n.tick];
        if (n.prev != null) n.prev.next = n.next; else level.head = n.next;
        if (n.next != null) n.next.prev = n.prev; else level.tail = n.prev;
        level.aggSize -= n.sizeRemaining;
        if (level.head == null) {
            if (n.side == BUY) clearBidBit(n.tick); else clearAskBit(n.tick);
        }
    }

    // ---- bitmap ops ----------------------------------------------------------

    private void setBidBit(int tick) { occBids[tick >>> 6] |= 1L << (tick & 63); }
    private void setAskBit(int tick) { occAsks[tick >>> 6] |= 1L << (tick & 63); }

    private void clearBidBit(int tick) {
        occBids[tick >>> 6] &= ~(1L << (tick & 63));
        if (tick == bestBidTick) bestBidTick = scanDown(occBids, tick - 1);
    }

    private void clearAskBit(int tick) {
        occAsks[tick >>> 6] &= ~(1L << (tick & 63));
        if (tick == bestAskTick) bestAskTick = scanUp(occAsks, tick + 1);
    }

    private static int scanDown(long[] bits, int from) {
        if (from < 0) return -1;
        int idx = from >>> 6;
        int bit = from & 63;
        long word = bits[idx] & ((bit == 63) ? -1L : (1L << (bit + 1)) - 1);
        while (word == 0L) {
            if (--idx < 0) return -1;
            word = bits[idx];
        }
        return (idx << 6) + (63 - Long.numberOfLeadingZeros(word));
    }

    private static int scanUp(long[] bits, int from) {
        if (from >= TICKS) return TICKS;
        int idx = from >>> 6;
        int bit = from & 63;
        long mask = -1L << bit;
        long word = bits[idx] & mask;
        while (word == 0L) {
            if (++idx >= bits.length) return TICKS;
            word = bits[idx];
        }
        return (idx << 6) + Long.numberOfTrailingZeros(word);
    }

    // ---- order id index (open addressing) ------------------------------------

    private OrderNode lookup(long id) {
        int h = mix(id) & tableMask;
        for (;;) {
            OrderNode n = table[h];
            if (n == null) return null;
            if (n.orderId == id) return n;
            h = (h + 1) & tableMask;
        }
    }
    private void index(long id, OrderNode n) {
        int h = mix(id) & tableMask;
        while (table[h] != null) h = (h + 1) & tableMask;
        table[h] = n;
    }
    private void evict(long id) {
        int h = mix(id) & tableMask;
        while (table[h] != null && table[h].orderId != id) h = (h + 1) & tableMask;
        if (table[h] == null) return;
        table[h] = null;
        // Robin-hood backshift omitted for brevity.
    }
    private static int mix(long id) {
        long x = id;
        x ^= x >>> 33; x *= 0xff51afd7ed558ccdL;
        x ^= x >>> 33; x *= 0xc4ceb9fe1a85ec53L;
        x ^= x >>> 33;
        return (int) x;
    }

    // ---- pool ----------------------------------------------------------------

    private OrderNode acquire() {
        OrderNode n = pool;
        if (n == null) return new OrderNode();
        pool = n.next;
        n.prev = null; n.next = null;
        return n;
    }
    private void release(OrderNode n) {
        n.prev = null; n.next = null; n.sizeRemaining = 0;
        n.next = pool; pool = n;
    }

    // ---- types ---------------------------------------------------------------

    static final class PriceLevel {
        final int tick;
        OrderNode head, tail;
        long aggSize;
        PriceLevel(int tick) { this.tick = tick; }
        void appendTail(OrderNode n) {
            n.prev = tail; n.next = null;
            if (tail == null) head = n; else tail.next = n;
            tail = n;
            aggSize += n.sizeRemaining;
        }
    }

    static final class OrderNode {
        long orderId;
        long sizeRemaining;
        int  tick;
        byte side;
        OrderNode prev, next;
    }
}

Things to notice:

The matching loop never allocates. Fills emit through a callback (omitted) that writes into a pre-allocated outbound ring buffer.
clearBidBit / clearAskBit cost one bitmap write plus, in the rare case the cursor moved off the top, one scanDown / scanUp. That is the only place numberOfTrailingZeros/numberOfLeadingZeros runs.
The order-id table is open-addressed. Production code would Robin-hood backshift on delete, left out here for clarity.
Fills, audit writes, replication, risk checks, and price-band checks are all missing. They go upstream of the book thread.

sequenceDiagram
  participant U as Aggressive BUY (limit 105, qty 4500)
  participant B as Book
  participant L4 as Asks[104]
  participant L5 as Asks[105]
  U->>B: add(orderId=42, side=BUY, tick=105, qty=4500)
  B->>L4: consume head O7 (1500)
  L4-->>B: filled 1500, level still has O12, O19
  B->>L4: consume head O12 (2000)
  L4-->>B: filled 2000, level empty
  B->>L5: consume head Oxx (1000)
  L5-->>B: filled 1000, qty exhausted
  B->>U: 3 fills, residue 0
  Note right of B: If residue > 0 it would post at tick 105 on the bid ladder.

11. Production deviations

Real engines diverge from this core design in honest, well-documented ways. Here are the ones worth knowing.

11.1 Hidden and iceberg orders

A hidden order is invisible on the public market data feed but live in the book. Internally it lives in the same FIFO as visible orders, with a flag. Visible-first matching rules apply at some venues; that means inside a price level you do not have one FIFO, you have two — one visible, one hidden — and the matching loop walks the visible one first.

An iceberg publishes only its displaySize; whenever that slice is consumed, it refreshes from the hidden remainder. Implementation: the order’s queue position gets reset on each refresh (you go to the back of the visible FIFO), so the iceberg behaves like a stream of small orders from a queue-priority perspective. Engineering: two size fields per node and a refresh() hook in consume().

11.2 Peg orders

Peg-to-midpoint, peg-to-best-bid, etc. The price is not a constant; it follows the NBBO. The implementation is a side index of pegged orders; on every NBBO change you re-tick them (which means cancel from old level, insert at new level). High-volatility tickers can saturate this; production engines limit peg counts and apply offsets discretely.

11.3 Self-trade prevention

When the same firm has both sides of the cross, exchanges must apply self-trade prevention (STP) rules: cancel-newest, cancel-oldest, decrement-and-cancel, etc. Implementation: every order carries a participantId; the matching loop checks before each fill. A brief but mandatory branch in the hottest loop in the system.

11.4 Audit, journaling, replay determinism

Every input event must be journaled before being applied so you can rebuild state on a crash. The exact-replay requirement means the matching loop must be deterministic. No HashMap iteration order leakage, no non-deterministic scheduling, no time-based decisions that depend on wall-clock. exchange-core’s snapshot+journal model is a textbook implementation: input events are written sequentially to disk (compressed with LZ4), and a snapshot serializer can write the entire book state for fast recovery.

11.5 Price bands, throttles, kill switches

Layered above the book, but they are part of the engine. Every order is checked against a band relative to last trade or reference price. Aggressive orders that would punch through the band are rejected pre-match. These checks have to be fast: they are O(1) lookups on a per-symbol record.

11.6 The “naive vs direct” implementation split

Real engines often ship two book implementations: a simple TreeMap-style one used for tests and slow-path verification, and a fast array+bitmap one used in production. Both implement the same interface. exchange-core does this explicitly with OrderBookNaive and OrderBookDirect. The cross-validation between them under load is one of the cheapest production safety nets you will ever build.

12. Complexity comparison

Operation	TreeMap (sorted map)	Array + bitmap	Hash + sorted set	Hash + heap
`add`	O(log P)	O(1)	O(log P)	O(log P)
`cancel` (by id)	O(1) typical, O(log P) if level becomes empty	O(1) amortised	O(1) typical, O(log P) if level becomes empty	O(1) lazy
`match` per fill	O(log P) per emptied level	O(1) per fill	O(log P) per emptied level	O(log P) per emptied level
`bestBid/bestAsk`	O(1) (cached)	O(1) (cursor)	O(1) (sorted-set head)	O(1) (heap top)
`snapshot(D)`	O(D)	O(D)	O(D)	O(D), unordered → needs sort
Memory cost	nodes, GC churn	array proportional to tick range	hash table + tree	hash table + heap
Cache behaviour	bad (scattered nodes)	excellent (linear scan + bitmap)	poor (probe chains, scattered levels)	poor (heap sift)
Pre-allocation friendly	hard	trivial	medium	medium
Best for	prototypes, low rate	HFT matching engine on bounded tick range	unbounded price ranges	rare; conceptual

The table traces the arc of the whole post: reach for the array-and-bitmap ladder when the tick range is bounded and the latency tail matters (the matching-engine case), and for the sorted-map or hash designs when the price range is open-ended or the event rate doesn’t justify the machinery.

TL;DR#

1. Foundations#

1.1 The limit order book#

1.2 L1, L2, L3 market data#

1.3 Order types#

1.4 Operations the book must support#

2. The ADT, precisely#

3. Approach 1 — sorted map (TreeMap / std::map)#

Complexities#

Why it is correct but slow#

When it is the right choice#

4. Approach 2 — array-indexed price ladder (the canonical fast design)#

4.1 Prices are integers on a tick grid#

4.2 The structure#

4.3 Operations and complexities#

4.4 The bitmap trick#

4.5 Memory cost#

5. Approach 3 — hash map by price + linked list per level#

6. Approach 4 — hash map + heap of price levels#

7. Per-order data and the order-id index#

8. Cache and memory layout#

8.1 Object layout for Order and PriceLevel#

8.2 Allocation discipline#

8.3 Off-heap, Unsafe, Foreign Function & Memory (FFM)#

9. Threading model#

9.1 Why a single writer wins#

9.2 Sequencer + queue#

9.3 Multi-symbol partitioning#

10. Reference Java implementation#

11. Production deviations#

11.1 Hidden and iceberg orders#

11.2 Peg orders#

11.3 Self-trade prevention#

11.4 Audit, journaling, replay determinism#

11.5 Price bands, throttles, kill switches#

11.6 The “naive vs direct” implementation split#

12. Complexity comparison#

Further reading#

TL;DR

1. Foundations

1.1 The limit order book

1.2 L1, L2, L3 market data

1.3 Order types

1.4 Operations the book must support

2. The ADT, precisely

3. Approach 1 — sorted map (`TreeMap` / `std::map`)

Complexities

Why it is correct but slow

When it is the right choice

4. Approach 2 — array-indexed price ladder (the canonical fast design)

4.1 Prices are integers on a tick grid

4.2 The structure

4.3 Operations and complexities

4.4 The bitmap trick

4.5 Memory cost

5. Approach 3 — hash map by price + linked list per level

6. Approach 4 — hash map + heap of price levels

7. Per-order data and the order-id index

8. Cache and memory layout

8.1 Object layout for `Order` and `PriceLevel`

8.2 Allocation discipline

8.3 Off-heap, `Unsafe`, Foreign Function & Memory (FFM)

9. Threading model

9.1 Why a single writer wins

9.2 Sequencer + queue

9.3 Multi-symbol partitioning

10. Reference Java implementation

11. Production deviations

11.1 Hidden and iceberg orders

11.2 Peg orders

11.3 Self-trade prevention

11.4 Audit, journaling, replay determinism

11.5 Price bands, throttles, kill switches

11.6 The “naive vs direct” implementation split

12. Complexity comparison

Further reading