Module 4: Transactions & Consistency: Case Studies

These case studies turn transaction vocabulary into design judgment. Work through them by drawing histories, naming anomalies, choosing isolation levels, and writing retry or recovery rules. A transaction design is not complete until it says what can interleave, what must abort, what must be retried, and what happens when a participant or service fails halfway through.

How To Use These Case Studies

Draw the concurrent history or distributed workflow.
Name the exact guarantee involved: atomicity, isolation, durability, serializability, linearizability, or eventual consistency.
Mark the smallest point where the wrong design becomes unsafe.
Produce the required artifact.
Connect the case to a project transaction, checkout, booking, payment, inventory, or workflow path.

Case Study 1: Write Skew In An On-Call Scheduler

Scenario: A hospital scheduling app requires at least one doctor to remain on call. Two doctors are on call. Each doctor opens the app, sees the other doctor is still on call, and turns themselves off. Both transactions commit under snapshot-style isolation, leaving zero doctors on call.

-- T1 and T2 both run this shape
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT count(*) FROM shifts WHERE day = '2026-05-08' AND on_call = true;
UPDATE shifts SET on_call = false WHERE doctor_id = $current_doctor;
COMMIT;

Source anchor: PostgreSQL documents that REPEATABLE READ sees a stable snapshot and can still allow serialization anomalies, while SERIALIZABLE monitors for conditions that would make concurrent execution inconsistent with any serial order and may abort a transaction. The PostgreSQL SSI notes explain how serializable snapshot isolation detects dangerous read/write conflict structures. See PostgreSQL transaction isolation, PostgreSQL SSI, and PostgreSQL serialization failure handling.

Module concepts:

snapshot isolation
write skew
predicate dependency
serializable isolation
invariant preservation
transaction retry

Wrong Approach

"Each transaction checks the invariant before it writes, so the invariant is protected."

The check is performed against each transaction's own snapshot. Neither transaction directly overwrites the other's row, so row-level write conflict detection is not enough.

Better Approach

Protect the invariant, not only the row:

Option A:
  Run the operation at SERIALIZABLE and retry serialization failures.

Option B:
  Materialize the invariant into a row that both transactions update.

Option C:
  Take an explicit lock around the schedule/day before changing on-call state.

Serializable isolation is often the cleanest learning answer: if the transaction is correct when run alone, the database either gives a serializable outcome or aborts one transaction.

Tradeoff Table

Choice	Gain	Cost
repeatable read / snapshot isolation	high concurrency, stable reads	write skew can violate cross-row invariants
serializable isolation	prevents serialization anomalies	transactions can abort and must be retried
materialized counter/guard row	simple conflict point	extra schema and update discipline
explicit schedule lock	clear critical section	lower concurrency and deadlock risk if inconsistent

Failure Mode

The app passes all single-user tests, then violates a real business invariant under concurrent use.

Required Artifact

Draw the history:

T1: read count=2
T2: read count=2
T1: set doctor A off
T2: set doctor B off
T1: commit
T2: commit
Invariant after commit:
Serializable order possible? yes/no:
Fix:

Project / Capstone Connection

Any capstone that enforces "at least one", "at most N", "cannot double-book", or "balance must remain nonnegative" needs an invariant-level transaction design.

Case Study 2: Lost Update In Inventory Reservation

Scenario: A commerce service has one unit of a rare item left. Two checkout workers both read stock = 1, both subtract one in application memory, and both write stock = 0. Two orders are accepted even though only one item exists.

BEGIN;
SELECT stock FROM inventory WHERE sku = 'GPU-1';
-- app decides stock > 0
UPDATE inventory SET stock = 0 WHERE sku = 'GPU-1';
INSERT INTO orders (...);
COMMIT;

Source anchor: PostgreSQL documents Read Committed as the default isolation level and explains that each statement sees a fresh snapshot. PostgreSQL explicit locking docs describe row-level locks such as SELECT ... FOR UPDATE, and the transaction isolation docs describe how stronger levels affect concurrent updates. See PostgreSQL transaction isolation and PostgreSQL explicit locking.

Module concepts:

lost update
read-modify-write
row locks
optimistic concurrency
atomic conditional update
idempotent order creation

Wrong Approach

"The transaction makes the checkout atomic."

Atomicity only says each transaction commits all-or-nothing. It does not automatically make a stale read-modify-write sequence safe.

Better Approach

Make the conflict visible to the database:

UPDATE inventory
SET stock = stock - 1
WHERE sku = 'GPU-1'
  AND stock > 0;

Then accept the order only if exactly one row was updated. Alternative designs include SELECT ... FOR UPDATE, an optimistic version column, or serializable isolation with retry.

Tradeoff Table

Choice	Gain	Cost
read then write literal value	easy to code	stale write can lose another transaction's effect
atomic conditional update	compact and safe for one row	less expressive for multi-row invariants
`SELECT ... FOR UPDATE`	clear lock on the row	waits under contention
optimistic version check	no blocking until write	caller must handle conflict result
serializable transaction	broad correctness model	retry path required

Failure Mode

The database has no negative stock and every transaction is atomic, but the business still oversells because two transactions wrote the same derived value.

Required Artifact

Write two safe checkout variants:

Variant 1: atomic conditional update
Variant 2: explicit lock or optimistic version
Conflict signal:
User-facing behavior:
Idempotency key:

Project / Capstone Connection

Checkout, booking, quota, seat assignment, and invite-code features should avoid application-memory read-modify-write unless they also define a conflict check.

Case Study 3: Serializable Retry Treated As A Random Error

Scenario: A team moves a high-value transfer workflow to serializable isolation. Under load, some transactions fail with SQLSTATE 40001. The service logs them as database errors and returns HTTP 500 to users.

Source anchor: PostgreSQL serialization failure docs say applications using Repeatable Read or Serializable must be prepared to retry complete transactions after serialization failures, and that SQLSTATE 40001 identifies serialization_failure. CockroachDB's transaction retry docs give the same operational lesson for serializable systems: the client may need to restart the whole transaction when automatic retry is not possible. See PostgreSQL serialization failure handling and CockroachDB transaction retry errors.

Module concepts:

serializable isolation
serialization failure
transaction retry
idempotency
contention
user-visible reliability

Wrong Approach

"A serialization failure means the database failed."

In a serializable system, aborting one transaction can be the mechanism that preserves correctness. The application must treat it as a retryable outcome, not a corruption event.

Better Approach

Retry the entire transaction, including the application logic that chose values and SQL statements:

retry loop:
  begin transaction
  read current state
  validate business rule
  write effects
  commit
  if SQLSTATE 40001:
    discard all results
    backoff
    retry from the beginning

The retry must be idempotent at the API boundary. For transfers, the request should carry an idempotency key so a user retry cannot create duplicate money movement.

Tradeoff Table

Choice	Gain	Cost
return 500 on `40001`	simple implementation	correctness mechanism becomes user-facing failure
retry only failed statement	less work	can reuse stale decisions incorrectly
retry whole transaction	correct under serializable aborts	needs idempotency and backoff
reduce contention	fewer retries	may require schema/query redesign

Failure Mode

The system is theoretically safer after moving to serializable isolation, but practically less reliable because the app does not implement the contract that serializable isolation requires.

Required Artifact

Write a retry policy:

Retryable SQLSTATEs:
Maximum attempts:
Backoff:
Idempotency key:
Telemetry fields:
User-visible final failure:
Contention-reduction follow-up:

Project / Capstone Connection

Any capstone claiming serializable transactions must include retry handling in the service layer, not only SET TRANSACTION ISOLATION LEVEL SERIALIZABLE.

Case Study 4: Deadlock From Inconsistent Lock Ordering

Scenario: A wallet service transfers funds between accounts. Worker A transfers from account 1 to account 2. Worker B transfers from account 2 to account 1. Each locks the source account first, then the destination account. Under load, deadlocks appear.

T1 locks account 1
T2 locks account 2
T1 waits for account 2
T2 waits for account 1

Source anchor: PostgreSQL explicit-locking docs describe deadlocks and recommend acquiring locks in a consistent order to prevent them. PostgreSQL serialization failure handling also notes SQLSTATE 40P01 for deadlock_detected and says retry may be advisable. See PostgreSQL explicit locking and PostgreSQL serialization failure handling.

Module concepts:

row locks
lock ordering
deadlock
wait-for graph
transfer invariant
retry after abort

Wrong Approach

"Deadlocks are database bugs; increase the timeout."

Deadlocks are often design bugs in lock acquisition order. A timeout only controls when the database gives up.

Better Approach

Use a deterministic lock order:

SELECT *
FROM accounts
WHERE id IN ($from, $to)
ORDER BY id
FOR UPDATE;

Then apply debit/credit logic after both account rows are locked. If the database still reports a deadlock, retry the whole transaction.

Tradeoff Table

Choice	Gain	Cost
lock source then destination	intuitive business order	cycles when transfers run opposite directions
lock by sorted account ID	prevents this deadlock class	business logic must separate lock order from transfer direction
broad table lock	simple correctness	destroys concurrency
retry deadlocks only	handles rare cases	does not fix frequent design-level deadlocks

Failure Mode

The service works in sequential tests and fails only when traffic contains symmetric operations.

Required Artifact

Draw a wait-for graph and rewrite the transfer transaction:

Nodes:
Edges:
Cycle?:
Deterministic lock order:
Retry behavior:
Metric to alert on:

Project / Capstone Connection

Any transfer, swap, match, or booking workflow that touches two resources needs a lock-order rule.

Case Study 5: Prepared Transaction Left In Doubt

Scenario: A legacy order system uses two-phase commit across an orders database and a billing database. The coordinator sends PREPARE to both participants and crashes before sending the final decision. One participant remains prepared for hours.

Source anchor: PostgreSQL two-phase transaction docs describe PREPARE TRANSACTION, COMMIT PREPARED, and ROLLBACK PREPARED, and warn that prepared transactions are intended to be short-lived even though external availability issues can leave them in the prepared state for an extended interval. Prepared transactions can be inspected through pg_prepared_xacts. See PostgreSQL two-phase transactions and PostgreSQL PREPARE TRANSACTION.

Module concepts:

atomic commit
two-phase commit
prepared transaction
coordinator failure
blocking
operational recovery

Wrong Approach

"Two-phase commit means distributed transactions are solved."

2PC gives atomic commit if the protocol can finish, but a coordinator failure can leave participants in doubt. The operational runbook is part of the design.

Better Approach

Limit and monitor prepared transactions:

Before using 2PC:
  confirm both participants support prepared state safely
  define coordinator durable decision log
  monitor in-doubt/prepared age
  define manual commit/rollback authority
  test crash points

If the workflow can tolerate eventual consistency, a saga or outbox pattern may be a better operational fit.

Tradeoff Table

Choice	Gain	Cost
2PC	atomic commit across participants	blocking and recovery complexity
local transaction + outbox	reliable event publication	downstream state is eventually consistent
saga	service autonomy and availability	compensation logic is application-specific
manual reconciliation	handles rare edge cases	operational burden and delayed correctness

Failure Mode

Prepared transactions hold resources, block cleanup, and require human judgment because the final decision was not durably recoverable by the coordinator.

Required Artifact

Write a 2PC failure matrix:

Crash before prepare:
Crash after one prepare:
Crash after all prepare:
Crash after one commit prepared:
Participant timeout behavior:
Decision log location:
Manual recovery command:
Alert threshold:

Project / Capstone Connection

If a capstone proposes distributed transactions, it must include the in-doubt recovery path, not only the happy path.

Case Study 6: Saga Compensation That Is Not Idempotent

Scenario: A travel booking workflow reserves a flight, reserves a hotel, and charges a payment method. The hotel reservation fails after the flight is reserved. The service calls cancelFlight, but a network timeout hides whether the cancellation succeeded. A retry calls cancelFlight again and creates a second refund.

Source anchor: Azure Architecture Center's Compensating Transaction pattern explains that eventually consistent multi-step operations need application-specific undo actions, that compensation can itself fail, and that compensation steps should be idempotent and resumable. The same guidance connects compensating transactions to saga-style workflows. See Azure Compensating Transaction pattern.

Module concepts:

saga
eventual consistency
compensation
idempotency
workflow state
timeout ambiguity

Wrong Approach

"A compensation is just the inverse API call."

The inverse call can fail, time out, or be applied twice. Compensation is a workflow with its own state and correctness rules.

Better Approach

Make every step and compensation idempotent:

Forward step:
  reserveFlight(booking_id, flight_id)

Compensation:
  cancelFlight(booking_id, flight_id, compensation_id)

State:
  pending -> reserved -> cancel_requested -> cancelled

Retries must use stable identifiers. The workflow must be resumable from persisted state, and some failures must route to manual review instead of looping forever.

Tradeoff Table

Choice	Gain	Cost
2PC across services	strong atomic outcome	tight coupling and availability cost
saga with compensation	service autonomy	temporary inconsistency and custom undo logic
retry until success	simple mental model	duplicate side effects without idempotency
workflow state machine	resumable and observable	more design and storage work

Failure Mode

The system avoids distributed transactions but creates worse financial inconsistency because compensation is neither idempotent nor observable.

Required Artifact

Design a saga state machine:

Forward steps:
Compensation steps:
Idempotency keys:
Timeout handling:
Retry policy:
Manual review states:
Audit events:
User-facing status:

Project / Capstone Connection

Any capstone using microservices for orders, payments, reservations, or provisioning should include a saga or outbox design with explicit compensation semantics.

Case Study 7: Global Consistency With A Real Clock Cost

Scenario: A globally distributed product wants users in Karachi, London, and New York to see account changes in a single externally consistent order. The team says "just use timestamps" and stores updated_at from each app server.

Source anchor: Google Cloud Spanner documentation explains that TrueTime exposes a bounded clock uncertainty interval and enables Spanner to assign transaction timestamps with external consistency. Spanner's default serializable behavior gives strict transaction ordering, but the system pays coordination and commit-wait costs to make real-time order meaningful. See Spanner TrueTime and external consistency.

Module concepts:

linearizability
external consistency
serializable transactions
clock uncertainty
commit timestamp
distributed coordination cost

Wrong Approach

"If every server writes a timestamp, later timestamps define the global truth."

Ordinary app-server clocks do not give a safe real-time ordering guarantee across machines and regions. Clock skew can make a later operation appear earlier.

Better Approach

Use a system that explicitly defines the consistency contract:

Need:
  externally consistent transactions across regions

Mechanism:
  bounded clock uncertainty
  commit timestamp assignment
  commit wait / coordination

Design question:
  Which workflows need this guarantee?
  Which workflows can use regional primary, causal/session consistency, or async replication?

Tradeoff Table

Choice	Gain	Cost
app-server timestamps	simple and cheap	no true global ordering guarantee
regional primary	clear ordering in one region	cross-region latency/failover tradeoff
externally consistent database	strong global transaction semantics	coordination, latency, and platform cost
eventual consistency + conflict policy	high availability and locality	app must handle conflicts and stale reads

Failure Mode

Audit logs, balances, or permissions appear to move backward across regions because the timestamp was treated as a consistency protocol.

Required Artifact

Write a consistency contract:

Operation:
Required ordering:
User-visible freshness:
Allowed stale window:
Conflict policy:
Chosen consistency model:
Latency budget:
Why weaker model is insufficient:

Project / Capstone Connection

If a capstone uses multi-region replication, do not say "consistent globally" unless the design names the mechanism and the latency/availability cost.

Source Map

Source	Use it for
PostgreSQL transaction isolation	isolation levels, snapshots, serializable behavior, anomalies
PostgreSQL SSI	serializable snapshot isolation and write-skew prevention
PostgreSQL serialization failure handling	retrying `40001`, deadlock retry considerations, whole-transaction retry
PostgreSQL explicit locking	row/table locks, deadlocks, lock ordering
PostgreSQL two-phase transactions	2PC support, prepared transactions, in-doubt state
PostgreSQL PREPARE TRANSACTION	prepared transaction lifecycle and recovery commands
CockroachDB transaction retry errors	serializable retry behavior in a distributed SQL system
Azure Compensating Transaction pattern	sagas, compensation, idempotency, resumable workflows
Spanner TrueTime and external consistency	external consistency, bounded clock uncertainty, global transaction ordering

Completion Standard

At least three case-study artifacts are completed.
At least one artifact draws a concurrent transaction history and names the anomaly.
At least one artifact includes a retry policy for 40001, deadlock, or unique-conflict behavior.
At least one artifact includes a lock-order or wait-for graph.
At least one artifact includes a distributed workflow failure matrix or saga state machine.
At least one case connects isolation choices to Module 3 replication/partitioning.
At least one case connects atomic commit or consistency models to Module 5 distributed-systems fundamentals.

How To Use These Case Studies​

Case Study 1: Write Skew In An On-Call Scheduler​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Case Study 2: Lost Update In Inventory Reservation​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Case Study 3: Serializable Retry Treated As A Random Error​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Case Study 4: Deadlock From Inconsistent Lock Ordering​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Case Study 5: Prepared Transaction Left In Doubt​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Case Study 6: Saga Compensation That Is Not Idempotent​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Case Study 7: Global Consistency With A Real Clock Cost​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Source Map​

Completion Standard​

How To Use These Case Studies

Case Study 1: Write Skew In An On-Call Scheduler

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Case Study 2: Lost Update In Inventory Reservation

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Case Study 3: Serializable Retry Treated As A Random Error

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Case Study 4: Deadlock From Inconsistent Lock Ordering

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Case Study 5: Prepared Transaction Left In Doubt

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Case Study 6: Saga Compensation That Is Not Idempotent

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Case Study 7: Global Consistency With A Real Clock Cost

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Source Map

Completion Standard