Skip to main content

Lock-Free Data Structures and Memory Ordering Models

What This Concept Is

A lock-free algorithm guarantees that some thread always makes progress, even if individual threads can be delayed indefinitely. It is built from atomic primitives (CAS, fetch-add, load/store with explicit ordering) instead of locks. A stronger property, wait-free, guarantees every thread makes progress in a bounded number of steps. Lock-free is common in production; wait-free is rarer and usually more expensive.

The memory model governs which reorderings of loads and stores are allowed. Modern C/C++/Rust atomics expose ordering levels:

  • memory_order_relaxed -- atomic but no inter-thread ordering. Good for counters where order does not matter.
  • memory_order_acquire -- no reads or writes in the current thread can be reordered before this load.
  • memory_order_release -- no reads or writes in the current thread can be reordered after this store.
  • memory_order_acq_rel -- both, for read-modify-write.
  • memory_order_seq_cst -- sequential consistency: a single total order over all seq_cst operations.

Acquire/release pairs form a happens-before relationship: a reader using acquire sees everything a writer did before its release store, without needing a lock.

Why It Matters Here

Lock-free is where single-threaded intuition breaks most violently. On a relaxed memory model, two stores can be reordered, two loads can be reordered, a store can be reordered past a later load. Without explicit ordering, "obviously correct" code is often wrong.

It is also the setting where the ABA problem lives, where cache-line layout becomes part of correctness, and where you must think carefully about memory reclamation: you cannot free a node another thread might still be about to dereference.

Concrete Example

A classic lock-free stack using CAS:

typedef struct node { int val; struct node *next; } node_t;
_Atomic(node_t *) top;

void push(int v) {
node_t *n = malloc(sizeof *n);
n->val = v;
node_t *old;
do {
old = atomic_load_explicit(&top, memory_order_relaxed);
n->next = old;
} while (!atomic_compare_exchange_weak_explicit(
&top, &old, n,
memory_order_release, memory_order_relaxed));
}

int pop(int *out) {
node_t *old;
do {
old = atomic_load_explicit(&top, memory_order_acquire);
if (!old) return 0;
} while (!atomic_compare_exchange_weak_explicit(
&top, &old, old->next,
memory_order_acquire, memory_order_acquire));
*out = old->val;
free(old); // WARNING: unsafe; see ABA and hazard pointers
return 1;
}

Two hazards:

  • ABA. Thread 1 reads top = A. Thread 2 pops A, pops B, pushes A back (reusing the memory). Thread 1's CAS now succeeds, but top->next is no longer what it was. Fix: tagged pointers (counter + pointer in a double-word CAS), or never reuse memory immediately (hazard pointers, epoch-based reclamation, RCU).
  • Reclamation. free(old) can race with other threads still dereferencing old. Fix: hazard pointers or epoch-based reclamation.

A mutex-protected stack has neither problem; the CAS version trades simplicity for non-blocking progress.

Common Confusion / Misconception

"Lock-free is always faster than locked." Under low contention it is usually comparable or marginally faster; under high contention a well-tuned mutex can beat a lock-free structure because the mutex blocks contenders instead of spinning them through CAS failures. Always measure.

"atomic<int> with default ordering is as fast as int." No. std::atomic<T> defaults to seq_cst, which on x86 inserts mfence-equivalent barriers on stores. For counters where order does not matter, memory_order_relaxed can be 5-10x faster.

"x86 is strongly ordered, so I can skip memory orderings." C/C++ memory ordering is a source-level contract that also constrains the compiler. Even on x86, a missing release can let the compiler reorder stores. Do not rely on architectural strength to fix source-level hazards.

"CAS avoids the ABA problem." CAS creates the ABA problem. Avoiding it requires additional machinery.

How To Use It

Before reaching for lock-free:

  1. Confirm contention is real and your mutex version is the bottleneck.
  2. Confirm you understand the memory model of your language and target CPU.
  3. Plan memory reclamation (hazard pointers, epoch-based, or arena allocation) before writing the first CAS.
  4. Use published, peer-reviewed algorithms (Treiber stack, Michael-Scott queue, Harris-Michael list) rather than inventing your own.
  5. Test with a thread-sanitizer or formal model checker when possible.

Check Yourself

  1. What does a release store guarantee to a subsequent acquire load of the same variable?
  2. Explain the ABA problem and one mitigation.
  3. Why is memory_order_relaxed sometimes the correct choice?

Mini Drill or Application

Implement the Treiber lock-free stack above in C using <stdatomic.h>. Write a stress test with 8 threads pushing/popping. Use ThreadSanitizer (-fsanitize=thread). Observe that it reports no data race. Then intentionally change memory_order_release to memory_order_relaxed on the push CAS and note what TSan flags.

Read This Only If Stuck