Lock-Free Data Structures and Memory Ordering Models
What This Concept Is
A lock-free algorithm guarantees that some thread always makes progress, even if individual threads can be delayed indefinitely. It is built from atomic primitives (CAS, fetch-add, load/store with explicit ordering) instead of locks. A stronger property, wait-free, guarantees every thread makes progress in a bounded number of steps. Lock-free is common in production; wait-free is rarer and usually more expensive.
The memory model governs which reorderings of loads and stores are allowed. Modern C/C++/Rust atomics expose ordering levels:
memory_order_relaxed-- atomic but no inter-thread ordering. Good for counters where order does not matter.memory_order_acquire-- no reads or writes in the current thread can be reordered before this load.memory_order_release-- no reads or writes in the current thread can be reordered after this store.memory_order_acq_rel-- both, for read-modify-write.memory_order_seq_cst-- sequential consistency: a single total order over allseq_cstoperations.
Acquire/release pairs form a happens-before relationship: a reader using acquire sees everything a writer did before its release store, without needing a lock.
Why It Matters Here
Lock-free is where single-threaded intuition breaks most violently. On a relaxed memory model, two stores can be reordered, two loads can be reordered, a store can be reordered past a later load. Without explicit ordering, "obviously correct" code is often wrong.
It is also the setting where the ABA problem lives, where cache-line layout becomes part of correctness, and where you must think carefully about memory reclamation: you cannot free a node another thread might still be about to dereference.
Concrete Example
A classic lock-free stack using CAS:
typedef struct node { int val; struct node *next; } node_t;
_Atomic(node_t *) top;
void push(int v) {
node_t *n = malloc(sizeof *n);
n->val = v;
node_t *old;
do {
old = atomic_load_explicit(&top, memory_order_relaxed);
n->next = old;
} while (!atomic_compare_exchange_weak_explicit(
&top, &old, n,
memory_order_release, memory_order_relaxed));
}
int pop(int *out) {
node_t *old;
do {
old = atomic_load_explicit(&top, memory_order_acquire);
if (!old) return 0;
} while (!atomic_compare_exchange_weak_explicit(
&top, &old, old->next,
memory_order_acquire, memory_order_acquire));
*out = old->val;
free(old); // WARNING: unsafe; see ABA and hazard pointers
return 1;
}
Two hazards:
- ABA. Thread 1 reads
top = A. Thread 2 popsA, popsB, pushesAback (reusing the memory). Thread 1's CAS now succeeds, buttop->nextis no longer what it was. Fix: tagged pointers (counter + pointer in a double-word CAS), or never reuse memory immediately (hazard pointers, epoch-based reclamation, RCU). - Reclamation.
free(old)can race with other threads still dereferencingold. Fix: hazard pointers or epoch-based reclamation.
A mutex-protected stack has neither problem; the CAS version trades simplicity for non-blocking progress.
Common Confusion / Misconception
"Lock-free is always faster than locked." Under low contention it is usually comparable or marginally faster; under high contention a well-tuned mutex can beat a lock-free structure because the mutex blocks contenders instead of spinning them through CAS failures. Always measure.
"atomic<int> with default ordering is as fast as int." No. std::atomic<T> defaults to seq_cst, which on x86 inserts mfence-equivalent barriers on stores. For counters where order does not matter, memory_order_relaxed can be 5-10x faster.
"x86 is strongly ordered, so I can skip memory orderings." C/C++ memory ordering is a source-level contract that also constrains the compiler. Even on x86, a missing release can let the compiler reorder stores. Do not rely on architectural strength to fix source-level hazards.
"CAS avoids the ABA problem." CAS creates the ABA problem. Avoiding it requires additional machinery.
How To Use It
Before reaching for lock-free:
- Confirm contention is real and your mutex version is the bottleneck.
- Confirm you understand the memory model of your language and target CPU.
- Plan memory reclamation (hazard pointers, epoch-based, or arena allocation) before writing the first CAS.
- Use published, peer-reviewed algorithms (Treiber stack, Michael-Scott queue, Harris-Michael list) rather than inventing your own.
- Test with a thread-sanitizer or formal model checker when possible.
Check Yourself
- What does a
releasestore guarantee to a subsequentacquireload of the same variable? - Explain the ABA problem and one mitigation.
- Why is
memory_order_relaxedsometimes the correct choice?
Mini Drill or Application
Implement the Treiber lock-free stack above in C using <stdatomic.h>. Write a stress test with 8 threads pushing/popping. Use ThreadSanitizer (-fsanitize=thread). Observe that it reports no data race. Then intentionally change memory_order_release to memory_order_relaxed on the push CAS and note what TSan flags.