Module 4: Transactions & Consistency: Case Studies
These case studies turn transaction vocabulary into design judgment. Work through them by drawing histories, naming anomalies, choosing isolation levels, and writing retry or recovery rules. A transaction design is not complete until it says what can interleave, what must abort, what must be retried, and what happens when a participant or service fails halfway through.
How To Use These Case Studies
- Draw the concurrent history or distributed workflow.
- Name the exact guarantee involved: atomicity, isolation, durability, serializability, linearizability, or eventual consistency.
- Mark the smallest point where the wrong design becomes unsafe.
- Produce the required artifact.
- Connect the case to a project transaction, checkout, booking, payment, inventory, or workflow path.
Case Study 1: Write Skew In An On-Call Scheduler
Scenario: A hospital scheduling app requires at least one doctor to remain on call. Two doctors are on call. Each doctor opens the app, sees the other doctor is still on call, and turns themselves off. Both transactions commit under snapshot-style isolation, leaving zero doctors on call.
-- T1 and T2 both run this shape
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT count(*) FROM shifts WHERE day = '2026-05-08' AND on_call = true;
UPDATE shifts SET on_call = false WHERE doctor_id = $current_doctor;
COMMIT;
Source anchor: PostgreSQL documents that REPEATABLE READ sees a stable snapshot and can still allow serialization anomalies, while SERIALIZABLE monitors for conditions that would make concurrent execution inconsistent with any serial order and may abort a transaction. The PostgreSQL SSI notes explain how serializable snapshot isolation detects dangerous read/write conflict structures. See PostgreSQL transaction isolation, PostgreSQL SSI, and PostgreSQL serialization failure handling.
Module concepts:
- snapshot isolation
- write skew
- predicate dependency
- serializable isolation
- invariant preservation
- transaction retry
Wrong Approach
"Each transaction checks the invariant before it writes, so the invariant is protected."
The check is performed against each transaction's own snapshot. Neither transaction directly overwrites the other's row, so row-level write conflict detection is not enough.
Better Approach
Protect the invariant, not only the row:
Option A:
Run the operation at SERIALIZABLE and retry serialization failures.
Option B:
Materialize the invariant into a row that both transactions update.
Option C:
Take an explicit lock around the schedule/day before changing on-call state.
Serializable isolation is often the cleanest learning answer: if the transaction is correct when run alone, the database either gives a serializable outcome or aborts one transaction.
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| repeatable read / snapshot isolation | high concurrency, stable reads | write skew can violate cross-row invariants |
| serializable isolation | prevents serialization anomalies | transactions can abort and must be retried |
| materialized counter/guard row | simple conflict point | extra schema and update discipline |
| explicit schedule lock | clear critical section | lower concurrency and deadlock risk if inconsistent |
Failure Mode
The app passes all single-user tests, then violates a real business invariant under concurrent use.
Required Artifact
Draw the history:
T1: read count=2
T2: read count=2
T1: set doctor A off
T2: set doctor B off
T1: commit
T2: commit
Invariant after commit:
Serializable order possible? yes/no:
Fix:
Project / Capstone Connection
Any capstone that enforces "at least one", "at most N", "cannot double-book", or "balance must remain nonnegative" needs an invariant-level transaction design.
Case Study 2: Lost Update In Inventory Reservation
Scenario: A commerce service has one unit of a rare item left. Two checkout workers both read stock = 1, both subtract one in application memory, and both write stock = 0. Two orders are accepted even though only one item exists.
BEGIN;
SELECT stock FROM inventory WHERE sku = 'GPU-1';
-- app decides stock > 0
UPDATE inventory SET stock = 0 WHERE sku = 'GPU-1';
INSERT INTO orders (...);
COMMIT;
Source anchor: PostgreSQL documents Read Committed as the default isolation level and explains that each statement sees a fresh snapshot. PostgreSQL explicit locking docs describe row-level locks such as SELECT ... FOR UPDATE, and the transaction isolation docs describe how stronger levels affect concurrent updates. See PostgreSQL transaction isolation and PostgreSQL explicit locking.
Module concepts:
- lost update
- read-modify-write
- row locks
- optimistic concurrency
- atomic conditional update
- idempotent order creation
Wrong Approach
"The transaction makes the checkout atomic."
Atomicity only says each transaction commits all-or-nothing. It does not automatically make a stale read-modify-write sequence safe.
Better Approach
Make the conflict visible to the database:
UPDATE inventory
SET stock = stock - 1
WHERE sku = 'GPU-1'
AND stock > 0;
Then accept the order only if exactly one row was updated. Alternative designs include SELECT ... FOR UPDATE, an optimistic version column, or serializable isolation with retry.
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| read then write literal value | easy to code | stale write can lose another transaction's effect |
| atomic conditional update | compact and safe for one row | less expressive for multi-row invariants |
SELECT ... FOR UPDATE | clear lock on the row | waits under contention |
| optimistic version check | no blocking until write | caller must handle conflict result |
| serializable transaction | broad correctness model | retry path required |
Failure Mode
The database has no negative stock and every transaction is atomic, but the business still oversells because two transactions wrote the same derived value.
Required Artifact
Write two safe checkout variants:
Variant 1: atomic conditional update
Variant 2: explicit lock or optimistic version
Conflict signal:
User-facing behavior:
Idempotency key:
Project / Capstone Connection
Checkout, booking, quota, seat assignment, and invite-code features should avoid application-memory read-modify-write unless they also define a conflict check.
Case Study 3: Serializable Retry Treated As A Random Error
Scenario: A team moves a high-value transfer workflow to serializable isolation. Under load, some transactions fail with SQLSTATE 40001. The service logs them as database errors and returns HTTP 500 to users.
Source anchor: PostgreSQL serialization failure docs say applications using Repeatable Read or Serializable must be prepared to retry complete transactions after serialization failures, and that SQLSTATE 40001 identifies serialization_failure. CockroachDB's transaction retry docs give the same operational lesson for serializable systems: the client may need to restart the whole transaction when automatic retry is not possible. See PostgreSQL serialization failure handling and CockroachDB transaction retry errors.
Module concepts:
- serializable isolation
- serialization failure
- transaction retry
- idempotency
- contention
- user-visible reliability
Wrong Approach
"A serialization failure means the database failed."
In a serializable system, aborting one transaction can be the mechanism that preserves correctness. The application must treat it as a retryable outcome, not a corruption event.
Better Approach
Retry the entire transaction, including the application logic that chose values and SQL statements:
retry loop:
begin transaction
read current state
validate business rule
write effects
commit
if SQLSTATE 40001:
discard all results
backoff
retry from the beginning
The retry must be idempotent at the API boundary. For transfers, the request should carry an idempotency key so a user retry cannot create duplicate money movement.
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
return 500 on 40001 | simple implementation | correctness mechanism becomes user-facing failure |
| retry only failed statement | less work | can reuse stale decisions incorrectly |
| retry whole transaction | correct under serializable aborts | needs idempotency and backoff |
| reduce contention | fewer retries | may require schema/query redesign |
Failure Mode
The system is theoretically safer after moving to serializable isolation, but practically less reliable because the app does not implement the contract that serializable isolation requires.
Required Artifact
Write a retry policy:
Retryable SQLSTATEs:
Maximum attempts:
Backoff:
Idempotency key:
Telemetry fields:
User-visible final failure:
Contention-reduction follow-up:
Project / Capstone Connection
Any capstone claiming serializable transactions must include retry handling in the service layer, not only SET TRANSACTION ISOLATION LEVEL SERIALIZABLE.
Case Study 4: Deadlock From Inconsistent Lock Ordering
Scenario: A wallet service transfers funds between accounts. Worker A transfers from account 1 to account 2. Worker B transfers from account 2 to account 1. Each locks the source account first, then the destination account. Under load, deadlocks appear.
T1 locks account 1
T2 locks account 2
T1 waits for account 2
T2 waits for account 1
Source anchor: PostgreSQL explicit-locking docs describe deadlocks and recommend acquiring locks in a consistent order to prevent them. PostgreSQL serialization failure handling also notes SQLSTATE 40P01 for deadlock_detected and says retry may be advisable. See PostgreSQL explicit locking and PostgreSQL serialization failure handling.
Module concepts:
- row locks
- lock ordering
- deadlock
- wait-for graph
- transfer invariant
- retry after abort
Wrong Approach
"Deadlocks are database bugs; increase the timeout."
Deadlocks are often design bugs in lock acquisition order. A timeout only controls when the database gives up.
Better Approach
Use a deterministic lock order:
SELECT *
FROM accounts
WHERE id IN ($from, $to)
ORDER BY id
FOR UPDATE;
Then apply debit/credit logic after both account rows are locked. If the database still reports a deadlock, retry the whole transaction.
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| lock source then destination | intuitive business order | cycles when transfers run opposite directions |
| lock by sorted account ID | prevents this deadlock class | business logic must separate lock order from transfer direction |
| broad table lock | simple correctness | destroys concurrency |
| retry deadlocks only | handles rare cases | does not fix frequent design-level deadlocks |
Failure Mode
The service works in sequential tests and fails only when traffic contains symmetric operations.
Required Artifact
Draw a wait-for graph and rewrite the transfer transaction:
Nodes:
Edges:
Cycle?:
Deterministic lock order:
Retry behavior:
Metric to alert on:
Project / Capstone Connection
Any transfer, swap, match, or booking workflow that touches two resources needs a lock-order rule.
Case Study 5: Prepared Transaction Left In Doubt
Scenario: A legacy order system uses two-phase commit across an orders database and a billing database. The coordinator sends PREPARE to both participants and crashes before sending the final decision. One participant remains prepared for hours.
Source anchor: PostgreSQL two-phase transaction docs describe PREPARE TRANSACTION, COMMIT PREPARED, and ROLLBACK PREPARED, and warn that prepared transactions are intended to be short-lived even though external availability issues can leave them in the prepared state for an extended interval. Prepared transactions can be inspected through pg_prepared_xacts. See PostgreSQL two-phase transactions and PostgreSQL PREPARE TRANSACTION.
Module concepts:
- atomic commit
- two-phase commit
- prepared transaction
- coordinator failure
- blocking
- operational recovery
Wrong Approach
"Two-phase commit means distributed transactions are solved."
2PC gives atomic commit if the protocol can finish, but a coordinator failure can leave participants in doubt. The operational runbook is part of the design.
Better Approach
Limit and monitor prepared transactions:
Before using 2PC:
confirm both participants support prepared state safely
define coordinator durable decision log
monitor in-doubt/prepared age
define manual commit/rollback authority
test crash points
If the workflow can tolerate eventual consistency, a saga or outbox pattern may be a better operational fit.
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| 2PC | atomic commit across participants | blocking and recovery complexity |
| local transaction + outbox | reliable event publication | downstream state is eventually consistent |
| saga | service autonomy and availability | compensation logic is application-specific |
| manual reconciliation | handles rare edge cases | operational burden and delayed correctness |
Failure Mode
Prepared transactions hold resources, block cleanup, and require human judgment because the final decision was not durably recoverable by the coordinator.
Required Artifact
Write a 2PC failure matrix:
Crash before prepare:
Crash after one prepare:
Crash after all prepare:
Crash after one commit prepared:
Participant timeout behavior:
Decision log location:
Manual recovery command:
Alert threshold:
Project / Capstone Connection
If a capstone proposes distributed transactions, it must include the in-doubt recovery path, not only the happy path.
Case Study 6: Saga Compensation That Is Not Idempotent
Scenario: A travel booking workflow reserves a flight, reserves a hotel, and charges a payment method. The hotel reservation fails after the flight is reserved. The service calls cancelFlight, but a network timeout hides whether the cancellation succeeded. A retry calls cancelFlight again and creates a second refund.
Source anchor: Azure Architecture Center's Compensating Transaction pattern explains that eventually consistent multi-step operations need application-specific undo actions, that compensation can itself fail, and that compensation steps should be idempotent and resumable. The same guidance connects compensating transactions to saga-style workflows. See Azure Compensating Transaction pattern.
Module concepts:
- saga
- eventual consistency
- compensation
- idempotency
- workflow state
- timeout ambiguity
Wrong Approach
"A compensation is just the inverse API call."
The inverse call can fail, time out, or be applied twice. Compensation is a workflow with its own state and correctness rules.
Better Approach
Make every step and compensation idempotent:
Forward step:
reserveFlight(booking_id, flight_id)
Compensation:
cancelFlight(booking_id, flight_id, compensation_id)
State:
pending -> reserved -> cancel_requested -> cancelled
Retries must use stable identifiers. The workflow must be resumable from persisted state, and some failures must route to manual review instead of looping forever.
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| 2PC across services | strong atomic outcome | tight coupling and availability cost |
| saga with compensation | service autonomy | temporary inconsistency and custom undo logic |
| retry until success | simple mental model | duplicate side effects without idempotency |
| workflow state machine | resumable and observable | more design and storage work |
Failure Mode
The system avoids distributed transactions but creates worse financial inconsistency because compensation is neither idempotent nor observable.
Required Artifact
Design a saga state machine:
Forward steps:
Compensation steps:
Idempotency keys:
Timeout handling:
Retry policy:
Manual review states:
Audit events:
User-facing status:
Project / Capstone Connection
Any capstone using microservices for orders, payments, reservations, or provisioning should include a saga or outbox design with explicit compensation semantics.
Case Study 7: Global Consistency With A Real Clock Cost
Scenario: A globally distributed product wants users in Karachi, London, and New York to see account changes in a single externally consistent order. The team says "just use timestamps" and stores updated_at from each app server.
Source anchor: Google Cloud Spanner documentation explains that TrueTime exposes a bounded clock uncertainty interval and enables Spanner to assign transaction timestamps with external consistency. Spanner's default serializable behavior gives strict transaction ordering, but the system pays coordination and commit-wait costs to make real-time order meaningful. See Spanner TrueTime and external consistency.
Module concepts:
- linearizability
- external consistency
- serializable transactions
- clock uncertainty
- commit timestamp
- distributed coordination cost
Wrong Approach
"If every server writes a timestamp, later timestamps define the global truth."
Ordinary app-server clocks do not give a safe real-time ordering guarantee across machines and regions. Clock skew can make a later operation appear earlier.
Better Approach
Use a system that explicitly defines the consistency contract:
Need:
externally consistent transactions across regions
Mechanism:
bounded clock uncertainty
commit timestamp assignment
commit wait / coordination
Design question:
Which workflows need this guarantee?
Which workflows can use regional primary, causal/session consistency, or async replication?
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| app-server timestamps | simple and cheap | no true global ordering guarantee |
| regional primary | clear ordering in one region | cross-region latency/failover tradeoff |
| externally consistent database | strong global transaction semantics | coordination, latency, and platform cost |
| eventual consistency + conflict policy | high availability and locality | app must handle conflicts and stale reads |
Failure Mode
Audit logs, balances, or permissions appear to move backward across regions because the timestamp was treated as a consistency protocol.
Required Artifact
Write a consistency contract:
Operation:
Required ordering:
User-visible freshness:
Allowed stale window:
Conflict policy:
Chosen consistency model:
Latency budget:
Why weaker model is insufficient:
Project / Capstone Connection
If a capstone uses multi-region replication, do not say "consistent globally" unless the design names the mechanism and the latency/availability cost.
Source Map
| Source | Use it for |
|---|---|
| PostgreSQL transaction isolation | isolation levels, snapshots, serializable behavior, anomalies |
| PostgreSQL SSI | serializable snapshot isolation and write-skew prevention |
| PostgreSQL serialization failure handling | retrying 40001, deadlock retry considerations, whole-transaction retry |
| PostgreSQL explicit locking | row/table locks, deadlocks, lock ordering |
| PostgreSQL two-phase transactions | 2PC support, prepared transactions, in-doubt state |
| PostgreSQL PREPARE TRANSACTION | prepared transaction lifecycle and recovery commands |
| CockroachDB transaction retry errors | serializable retry behavior in a distributed SQL system |
| Azure Compensating Transaction pattern | sagas, compensation, idempotency, resumable workflows |
| Spanner TrueTime and external consistency | external consistency, bounded clock uncertainty, global transaction ordering |
Completion Standard
- At least three case-study artifacts are completed.
- At least one artifact draws a concurrent transaction history and names the anomaly.
- At least one artifact includes a retry policy for
40001, deadlock, or unique-conflict behavior. - At least one artifact includes a lock-order or wait-for graph.
- At least one artifact includes a distributed workflow failure matrix or saga state machine.
- At least one case connects isolation choices to Module 3 replication/partitioning.
- At least one case connects atomic commit or consistency models to Module 5 distributed-systems fundamentals.