Deadlock: Necessary Conditions, Prevention, Detection, Recovery

What This Concept Is

A deadlock is a state in which a set of threads is permanently blocked, each waiting for a resource held by another in the same set. Coffman (1971) proved that four conditions are simultaneously necessary for deadlock:

Mutual exclusion. At least one resource is held in a non-sharable mode.
Hold and wait. A thread holds at least one resource while waiting for another.
No preemption. A resource cannot be forcibly taken away from the thread holding it.
Circular wait. There exists a cycle T1 -> T2 -> ... -> Tn -> T1 in the "waits-for" graph.

Eliminating any one of them prevents deadlock. That gives four families of countermeasures, plus two strategies that handle deadlock after the fact:

Prevention. Design the system so one Coffman condition never holds.
Avoidance. Dynamically refuse resource requests that would lead to an unsafe state (Banker's algorithm). Rarely practical in general OSes; used in some schedulers.
Detection + recovery. Let deadlock happen; periodically search the waits-for graph for cycles; break them by killing a thread or rolling back a transaction.
Ignoring. The ostrich approach: common for infrequent deadlocks in desktop OSes. Not acceptable for databases or long-running services.

Why It Matters Here

Every practical concurrent system chooses among these strategies explicitly. Databases combine prevention (lock ordering where possible) with detection (cycle search in the wait-for graph, kill victim transaction). Filesystems typically prevent (strict inode lock ordering). The Linux kernel prevents (documented lock ordering and lockdep verification at debug time).

Knowing the four conditions lets you read any deadlock report and answer the diagnostic question: which condition must we break, and what does that cost?

Concrete Example

Prevention by ordering. A bank transfer must lock two accounts. If some transfer paths acquire in one order and others in the reverse, deadlock. Fix: every caller acquires by account id, smallest first.

void transfer(account_t *a, account_t *b, long amt) {
    account_t *first  = (a->id < b->id) ? a : b;
    account_t *second = (a->id < b->id) ? b : a;
    pthread_mutex_lock(&first->m);
    pthread_mutex_lock(&second->m);
    ...
    pthread_mutex_unlock(&second->m);
    pthread_mutex_unlock(&first->m);
}

Now the waits-for graph can never have a back edge.

Detection in databases. Postgres runs a deadlock detector after a configurable timeout on any blocked lock acquisition. It builds the waits-for graph, searches for a cycle, picks a victim, aborts it with ERROR: deadlock detected. The application retries.

Recovery by rollback. Transactional systems can preempt a lock by rolling the transaction back to a checkpoint. The four Coffman conditions become three because "no preemption" is negated.

Common Confusion / Misconception

"Deadlock is a programming bug that testing will find." Deadlocks depend on timing. They can sit dormant for years and surface once a workload pattern shifts. Static analysis, lockdep-style runtime verification, and documented lock orders catch them; typical test suites usually do not.

"Avoidance (Banker's algorithm) is the modern approach." No. Banker's requires knowing the maximum resource demand of each thread in advance, which is rarely true. Textbook prominent, production rare.

"I broke one Coffman condition, so I am safe." Only if you broke it everywhere in the system for the resources involved. A single code path that still has hold-and-wait reintroduces the risk.

"Livelock and starvation are the same as deadlock." Deadlock means permanent blocking; livelock means threads change state but never make progress; starvation means some thread never makes progress while others do. Different diagnoses, different fixes.

How To Use It

For every multi-lock protocol you design or audit:

List the resources (locks, semaphores, buffers).
For each code path that acquires more than one, write down the acquisition order.
Check for cycles in the resulting lock-order graph. If any exists, deadlock is possible.
Choose a strategy: prevent (define a global order), detect (add a timer + cycle check), or recover (use transactional rollback).
Document the chosen strategy and the Coffman condition you are breaking.

Check Yourself

State the four Coffman conditions and give a one-line example of how each could be broken.
Why is circular-wait prevention the most commonly used strategy in practice?
How does database deadlock detection differ from OS deadlock prevention?

Mini Drill or Application

Take your dining-philosophers code from concept 12. Add logging that prints, every 100 ms, the waits-for graph (which philosopher holds which fork, which is waiting for which). In the buggy version, capture the cycle. Apply the ordering fix. Capture the graph again and show that it is acyclic.

Detection and Recovery in Practice

Operating systems rarely run a deadlock detector on kernel locks; the cost is too high and the usual answer is "don't create cycles in the first place." Databases, on the other hand, detect aggressively. PostgreSQL runs a waits-for cycle check on every lock wait that exceeds deadlock_timeout (default 1 second), picks a victim transaction, and aborts it. The application sees a serialization-failure error and is expected to retry. This is why ordinary application code built on a transactional database can get away with ignoring lock order: the database turns any deadlock into a retryable exception. At the kernel and systems-programming level you do not have that luxury -- you must prevent.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Detection and Recovery in Practice​

Read This Only If Stuck​