Lock-Free Data Structures and Memory Ordering Models

What This Concept Is

A lock-free algorithm guarantees that some thread always makes progress, even if individual threads can be delayed indefinitely. It is built from atomic primitives (CAS, fetch-add, load/store with explicit ordering) instead of locks. A stronger property, wait-free, guarantees every thread makes progress in a bounded number of steps. Lock-free is common in production; wait-free is rarer and usually more expensive.

The memory model governs which reorderings of loads and stores are allowed. Modern C/C++/Rust atomics expose ordering levels:

memory_order_relaxed -- atomic but no inter-thread ordering. Good for counters where order does not matter.
memory_order_acquire -- no reads or writes in the current thread can be reordered before this load.
memory_order_release -- no reads or writes in the current thread can be reordered after this store.
memory_order_acq_rel -- both, for read-modify-write.
memory_order_seq_cst -- sequential consistency: a single total order over all seq_cst operations.

Acquire/release pairs form a happens-before relationship: a reader using acquire sees everything a writer did before its release store, without needing a lock.

Why It Matters Here

Lock-free is where single-threaded intuition breaks most violently. On a relaxed memory model, two stores can be reordered, two loads can be reordered, a store can be reordered past a later load. Without explicit ordering, "obviously correct" code is often wrong.

It is also the setting where the ABA problem lives, where cache-line layout becomes part of correctness, and where you must think carefully about memory reclamation: you cannot free a node another thread might still be about to dereference.

Concrete Example

A classic lock-free stack using CAS:

typedef struct node { int val; struct node *next; } node_t;
_Atomic(node_t *) top;

void push(int v) {
    node_t *n = malloc(sizeof *n);
    n->val = v;
    node_t *old;
    do {
        old = atomic_load_explicit(&top, memory_order_relaxed);
        n->next = old;
    } while (!atomic_compare_exchange_weak_explicit(
        &top, &old, n,
        memory_order_release, memory_order_relaxed));
}

int pop(int *out) {
    node_t *old;
    do {
        old = atomic_load_explicit(&top, memory_order_acquire);
        if (!old) return 0;
    } while (!atomic_compare_exchange_weak_explicit(
        &top, &old, old->next,
        memory_order_acquire, memory_order_acquire));
    *out = old->val;
    free(old);    // WARNING: unsafe; see ABA and hazard pointers
    return 1;
}

Two hazards:

ABA. Thread 1 reads top = A. Thread 2 pops A, pops B, pushes A back (reusing the memory). Thread 1's CAS now succeeds, but top->next is no longer what it was. Fix: tagged pointers (counter + pointer in a double-word CAS), or never reuse memory immediately (hazard pointers, epoch-based reclamation, RCU).
Reclamation. free(old) can race with other threads still dereferencing old. Fix: hazard pointers or epoch-based reclamation.

A mutex-protected stack has neither problem; the CAS version trades simplicity for non-blocking progress.

Common Confusion / Misconception

"Lock-free is always faster than locked." Under low contention it is usually comparable or marginally faster; under high contention a well-tuned mutex can beat a lock-free structure because the mutex blocks contenders instead of spinning them through CAS failures. Always measure.

"atomic<int> with default ordering is as fast as int." No. std::atomic<T> defaults to seq_cst, which on x86 inserts mfence-equivalent barriers on stores. For counters where order does not matter, memory_order_relaxed can be 5-10x faster.

"x86 is strongly ordered, so I can skip memory orderings." C/C++ memory ordering is a source-level contract that also constrains the compiler. Even on x86, a missing release can let the compiler reorder stores. Do not rely on architectural strength to fix source-level hazards.

"CAS avoids the ABA problem." CAS creates the ABA problem. Avoiding it requires additional machinery.

How To Use It

Before reaching for lock-free:

Confirm contention is real and your mutex version is the bottleneck.
Confirm you understand the memory model of your language and target CPU.
Plan memory reclamation (hazard pointers, epoch-based, or arena allocation) before writing the first CAS.
Use published, peer-reviewed algorithms (Treiber stack, Michael-Scott queue, Harris-Michael list) rather than inventing your own.
Test with a thread-sanitizer or formal model checker when possible.

Check Yourself

What does a release store guarantee to a subsequent acquire load of the same variable?
Explain the ABA problem and one mitigation.
Why is memory_order_relaxed sometimes the correct choice?

Mini Drill or Application

Implement the Treiber lock-free stack above in C using <stdatomic.h>. Write a stress test with 8 threads pushing/popping. Use ThreadSanitizer (-fsanitize=thread). Observe that it reports no data race. Then intentionally change memory_order_release to memory_order_relaxed on the push CAS and note what TSan flags.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​