Atomics and Memory Ordering at the C Level

What This Concept Is

C11 standardized atomics in <stdatomic.h>. An atomic operation is one that either happens entirely or not at all -- no other thread can observe it half-done.

Two orthogonal ideas:

Atomicity. The read-modify-write of a shared counter (*p += 1) is not atomic in general. atomic_fetch_add(p, 1) is atomic: the whole load-add-store is one indivisible operation the hardware supports via a LOCK prefix, an ldxr/stxr pair, or similar.
Memory ordering. Modern CPUs and compilers reorder instructions aggressively for speed. Memory ordering specifies which reorderings are forbidden as observed by other threads.

The orderings, from strongest to weakest:

Order	What it guarantees
`memory_order_seq_cst` (default)	Global total order; a single consistent history across threads.
`memory_order_acq_rel`	Acquire on read, release on write. Used for locks.
`memory_order_acquire`	Subsequent memory accesses cannot be reordered before this load.
`memory_order_release`	Prior memory accesses cannot be reordered after this store.
`memory_order_relaxed`	Atomic value only. No ordering guarantees relative to other memory.

The default (seq_cst) is the easy one to reason about; the weaker ones give measurable speedups on ARM and PowerPC but require careful invariant reasoning.

Why It Matters Here

Atomics let you build lock-free data structures (queues, counters, flags) and they are the mechanism underneath mutexes and condition variables themselves. Equally important: they are often not what you need. A while (!done) { ... } loop where done is not atomic can spin forever even after another thread writes done = 1, because the compiler is allowed to hoist the read out of the loop. The fix is atomic_bool done, not "just a volatile" -- volatile is for memory-mapped hardware registers, not threads.

Concrete Example

A shared counter, two ways:

#include <pthread.h>
#include <stdatomic.h>
#include <stdio.h>

atomic_long counter_atomic = 0;
long        counter_plain  = 0;
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;

static void *worker_atomic(void *arg) {
    (void)arg;
    for (int i = 0; i < 1000000; i++)
        atomic_fetch_add_explicit(&counter_atomic, 1, memory_order_relaxed);
    return NULL;
}

static void *worker_plain(void *arg) {
    (void)arg;
    for (int i = 0; i < 1000000; i++) counter_plain++;
    return NULL;
}

Two threads running worker_atomic end up with counter_atomic == 2000000 every time. Two threads running worker_plain end up with an essentially random number less than 2000000 because counter_plain++ is load/add/store and the threads interleave.

A release/acquire handshake -- the pattern that replaces a mutex for a single flag:

atomic_int ready = 0;
int payload;

/* producer */
payload = 42;
atomic_store_explicit(&ready, 1, memory_order_release);

/* consumer */
while (atomic_load_explicit(&ready, memory_order_acquire) == 0) {}
printf("%d\n", payload);   /* guaranteed to be 42 */

release on the store guarantees the write payload = 42 is visible to any thread that does a matching acquire load. Without the ordering, the consumer could see ready == 1 but still read stale payload.

Common Confusion / Misconception

"Atomics replace mutexes." They replace mutexes for simple shared state: counters, flags, single-pointer handoffs. As soon as you need to update two fields consistently (e.g., head and size of a queue), a mutex is almost always clearer and faster than any lock-free design you will write without specialized training.

"volatile makes my variable thread-safe." It does not. volatile tells the compiler not to cache the value in a register across reads or reorder accesses with other volatile accesses. It says nothing about ordering relative to non-volatile memory, and it does not emit the memory fences required on multicore CPUs. Use _Atomic/std::atomic for thread communication.

"memory_order_relaxed is faster so I should use it." Only if you understand you are giving up all ordering relative to other memory. A relaxed store of a done flag can be visible to another thread before the payload the flag was supposed to signal. The payload read can then tear or return garbage. Default to seq_cst; weaken only with a specific reason and a microbenchmark.

How To Use It

Practical guidance:

Default to mutexes for anything non-trivial. They are cheap on uncontended paths and correct on complex invariants.
Use atomics for (a) counters, (b) single-writer/many-reader flags, (c) single-producer/single-consumer handoffs.
Leave the ordering at seq_cst unless a profiler shows contention and you can name the invariant the weaker order preserves.
Never use volatile for thread communication.
Prefer atomic_flag with test_and_set/clear for the absolute minimum primitive (a spinlock).

Check Yourself

Why is counter++ in two threads sometimes wrong even when counter is volatile?
What does memory_order_release on a store guarantee about earlier writes?
In which situations is "just use a mutex" the right answer?

Mini Drill or Application

Do all four:

Compile both worker functions above. Run them 20 times each. Record the distribution of counter_plain.
Add a mutex-protected version of the plain worker. Compare runtimes vs the atomic version at high contention (e.g., 8 threads × 1,000,000 iterations).
Implement the ready/payload handshake and measure the latency in nanoseconds with clock_gettime(CLOCK_MONOTONIC, ...).
In one sentence, explain why memory_order_relaxed on a counter is safe but on a handoff flag is not.

Read This Only If Stuck

COD 2.11: Parallelism and Synchronization (Part 2)
COD 5.8: Parallelism and Memory Hierarchies -- Cache Coherence
Man page: man 7 stdatomic (note: atomics are a C11 language feature; see the standard)
cppreference: <stdatomic.h> -- comprehensive reference, C view
Preshing on Programming: Acquire and Release Semantics

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​