Skip to main content

Module 4: Transactions & Consistency: Case Studies

These case studies turn transaction vocabulary into design judgment. Work through them by drawing histories, naming anomalies, choosing isolation levels, and writing retry or recovery rules. A transaction design is not complete until it says what can interleave, what must abort, what must be retried, and what happens when a participant or service fails halfway through.


How To Use These Case Studies

  1. Draw the concurrent history or distributed workflow.
  2. Name the exact guarantee involved: atomicity, isolation, durability, serializability, linearizability, or eventual consistency.
  3. Mark the smallest point where the wrong design becomes unsafe.
  4. Produce the required artifact.
  5. Connect the case to a project transaction, checkout, booking, payment, inventory, or workflow path.

Case Study 1: Write Skew In An On-Call Scheduler

Scenario: A hospital scheduling app requires at least one doctor to remain on call. Two doctors are on call. Each doctor opens the app, sees the other doctor is still on call, and turns themselves off. Both transactions commit under snapshot-style isolation, leaving zero doctors on call.

-- T1 and T2 both run this shape
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT count(*) FROM shifts WHERE day = '2026-05-08' AND on_call = true;
UPDATE shifts SET on_call = false WHERE doctor_id = $current_doctor;
COMMIT;

Source anchor: PostgreSQL documents that REPEATABLE READ sees a stable snapshot and can still allow serialization anomalies, while SERIALIZABLE monitors for conditions that would make concurrent execution inconsistent with any serial order and may abort a transaction. The PostgreSQL SSI notes explain how serializable snapshot isolation detects dangerous read/write conflict structures. See PostgreSQL transaction isolation, PostgreSQL SSI, and PostgreSQL serialization failure handling.

Module concepts:

  • snapshot isolation
  • write skew
  • predicate dependency
  • serializable isolation
  • invariant preservation
  • transaction retry

Wrong Approach

"Each transaction checks the invariant before it writes, so the invariant is protected."

The check is performed against each transaction's own snapshot. Neither transaction directly overwrites the other's row, so row-level write conflict detection is not enough.

Better Approach

Protect the invariant, not only the row:

Option A:
Run the operation at SERIALIZABLE and retry serialization failures.

Option B:
Materialize the invariant into a row that both transactions update.

Option C:
Take an explicit lock around the schedule/day before changing on-call state.

Serializable isolation is often the cleanest learning answer: if the transaction is correct when run alone, the database either gives a serializable outcome or aborts one transaction.

Tradeoff Table

ChoiceGainCost
repeatable read / snapshot isolationhigh concurrency, stable readswrite skew can violate cross-row invariants
serializable isolationprevents serialization anomaliestransactions can abort and must be retried
materialized counter/guard rowsimple conflict pointextra schema and update discipline
explicit schedule lockclear critical sectionlower concurrency and deadlock risk if inconsistent

Failure Mode

The app passes all single-user tests, then violates a real business invariant under concurrent use.

Required Artifact

Draw the history:

T1: read count=2
T2: read count=2
T1: set doctor A off
T2: set doctor B off
T1: commit
T2: commit
Invariant after commit:
Serializable order possible? yes/no:
Fix:

Project / Capstone Connection

Any capstone that enforces "at least one", "at most N", "cannot double-book", or "balance must remain nonnegative" needs an invariant-level transaction design.


Case Study 2: Lost Update In Inventory Reservation

Scenario: A commerce service has one unit of a rare item left. Two checkout workers both read stock = 1, both subtract one in application memory, and both write stock = 0. Two orders are accepted even though only one item exists.

BEGIN;
SELECT stock FROM inventory WHERE sku = 'GPU-1';
-- app decides stock > 0
UPDATE inventory SET stock = 0 WHERE sku = 'GPU-1';
INSERT INTO orders (...);
COMMIT;

Source anchor: PostgreSQL documents Read Committed as the default isolation level and explains that each statement sees a fresh snapshot. PostgreSQL explicit locking docs describe row-level locks such as SELECT ... FOR UPDATE, and the transaction isolation docs describe how stronger levels affect concurrent updates. See PostgreSQL transaction isolation and PostgreSQL explicit locking.

Module concepts:

  • lost update
  • read-modify-write
  • row locks
  • optimistic concurrency
  • atomic conditional update
  • idempotent order creation

Wrong Approach

"The transaction makes the checkout atomic."

Atomicity only says each transaction commits all-or-nothing. It does not automatically make a stale read-modify-write sequence safe.

Better Approach

Make the conflict visible to the database:

UPDATE inventory
SET stock = stock - 1
WHERE sku = 'GPU-1'
AND stock > 0;

Then accept the order only if exactly one row was updated. Alternative designs include SELECT ... FOR UPDATE, an optimistic version column, or serializable isolation with retry.

Tradeoff Table

ChoiceGainCost
read then write literal valueeasy to codestale write can lose another transaction's effect
atomic conditional updatecompact and safe for one rowless expressive for multi-row invariants
SELECT ... FOR UPDATEclear lock on the rowwaits under contention
optimistic version checkno blocking until writecaller must handle conflict result
serializable transactionbroad correctness modelretry path required

Failure Mode

The database has no negative stock and every transaction is atomic, but the business still oversells because two transactions wrote the same derived value.

Required Artifact

Write two safe checkout variants:

Variant 1: atomic conditional update
Variant 2: explicit lock or optimistic version
Conflict signal:
User-facing behavior:
Idempotency key:

Project / Capstone Connection

Checkout, booking, quota, seat assignment, and invite-code features should avoid application-memory read-modify-write unless they also define a conflict check.


Case Study 3: Serializable Retry Treated As A Random Error

Scenario: A team moves a high-value transfer workflow to serializable isolation. Under load, some transactions fail with SQLSTATE 40001. The service logs them as database errors and returns HTTP 500 to users.

Source anchor: PostgreSQL serialization failure docs say applications using Repeatable Read or Serializable must be prepared to retry complete transactions after serialization failures, and that SQLSTATE 40001 identifies serialization_failure. CockroachDB's transaction retry docs give the same operational lesson for serializable systems: the client may need to restart the whole transaction when automatic retry is not possible. See PostgreSQL serialization failure handling and CockroachDB transaction retry errors.

Module concepts:

  • serializable isolation
  • serialization failure
  • transaction retry
  • idempotency
  • contention
  • user-visible reliability

Wrong Approach

"A serialization failure means the database failed."

In a serializable system, aborting one transaction can be the mechanism that preserves correctness. The application must treat it as a retryable outcome, not a corruption event.

Better Approach

Retry the entire transaction, including the application logic that chose values and SQL statements:

retry loop:
begin transaction
read current state
validate business rule
write effects
commit
if SQLSTATE 40001:
discard all results
backoff
retry from the beginning

The retry must be idempotent at the API boundary. For transfers, the request should carry an idempotency key so a user retry cannot create duplicate money movement.

Tradeoff Table

ChoiceGainCost
return 500 on 40001simple implementationcorrectness mechanism becomes user-facing failure
retry only failed statementless workcan reuse stale decisions incorrectly
retry whole transactioncorrect under serializable abortsneeds idempotency and backoff
reduce contentionfewer retriesmay require schema/query redesign

Failure Mode

The system is theoretically safer after moving to serializable isolation, but practically less reliable because the app does not implement the contract that serializable isolation requires.

Required Artifact

Write a retry policy:

Retryable SQLSTATEs:
Maximum attempts:
Backoff:
Idempotency key:
Telemetry fields:
User-visible final failure:
Contention-reduction follow-up:

Project / Capstone Connection

Any capstone claiming serializable transactions must include retry handling in the service layer, not only SET TRANSACTION ISOLATION LEVEL SERIALIZABLE.


Case Study 4: Deadlock From Inconsistent Lock Ordering

Scenario: A wallet service transfers funds between accounts. Worker A transfers from account 1 to account 2. Worker B transfers from account 2 to account 1. Each locks the source account first, then the destination account. Under load, deadlocks appear.

T1 locks account 1
T2 locks account 2
T1 waits for account 2
T2 waits for account 1

Source anchor: PostgreSQL explicit-locking docs describe deadlocks and recommend acquiring locks in a consistent order to prevent them. PostgreSQL serialization failure handling also notes SQLSTATE 40P01 for deadlock_detected and says retry may be advisable. See PostgreSQL explicit locking and PostgreSQL serialization failure handling.

Module concepts:

  • row locks
  • lock ordering
  • deadlock
  • wait-for graph
  • transfer invariant
  • retry after abort

Wrong Approach

"Deadlocks are database bugs; increase the timeout."

Deadlocks are often design bugs in lock acquisition order. A timeout only controls when the database gives up.

Better Approach

Use a deterministic lock order:

SELECT *
FROM accounts
WHERE id IN ($from, $to)
ORDER BY id
FOR UPDATE;

Then apply debit/credit logic after both account rows are locked. If the database still reports a deadlock, retry the whole transaction.

Tradeoff Table

ChoiceGainCost
lock source then destinationintuitive business ordercycles when transfers run opposite directions
lock by sorted account IDprevents this deadlock classbusiness logic must separate lock order from transfer direction
broad table locksimple correctnessdestroys concurrency
retry deadlocks onlyhandles rare casesdoes not fix frequent design-level deadlocks

Failure Mode

The service works in sequential tests and fails only when traffic contains symmetric operations.

Required Artifact

Draw a wait-for graph and rewrite the transfer transaction:

Nodes:
Edges:
Cycle?:
Deterministic lock order:
Retry behavior:
Metric to alert on:

Project / Capstone Connection

Any transfer, swap, match, or booking workflow that touches two resources needs a lock-order rule.


Case Study 5: Prepared Transaction Left In Doubt

Scenario: A legacy order system uses two-phase commit across an orders database and a billing database. The coordinator sends PREPARE to both participants and crashes before sending the final decision. One participant remains prepared for hours.

Source anchor: PostgreSQL two-phase transaction docs describe PREPARE TRANSACTION, COMMIT PREPARED, and ROLLBACK PREPARED, and warn that prepared transactions are intended to be short-lived even though external availability issues can leave them in the prepared state for an extended interval. Prepared transactions can be inspected through pg_prepared_xacts. See PostgreSQL two-phase transactions and PostgreSQL PREPARE TRANSACTION.

Module concepts:

  • atomic commit
  • two-phase commit
  • prepared transaction
  • coordinator failure
  • blocking
  • operational recovery

Wrong Approach

"Two-phase commit means distributed transactions are solved."

2PC gives atomic commit if the protocol can finish, but a coordinator failure can leave participants in doubt. The operational runbook is part of the design.

Better Approach

Limit and monitor prepared transactions:

Before using 2PC:
confirm both participants support prepared state safely
define coordinator durable decision log
monitor in-doubt/prepared age
define manual commit/rollback authority
test crash points

If the workflow can tolerate eventual consistency, a saga or outbox pattern may be a better operational fit.

Tradeoff Table

ChoiceGainCost
2PCatomic commit across participantsblocking and recovery complexity
local transaction + outboxreliable event publicationdownstream state is eventually consistent
sagaservice autonomy and availabilitycompensation logic is application-specific
manual reconciliationhandles rare edge casesoperational burden and delayed correctness

Failure Mode

Prepared transactions hold resources, block cleanup, and require human judgment because the final decision was not durably recoverable by the coordinator.

Required Artifact

Write a 2PC failure matrix:

Crash before prepare:
Crash after one prepare:
Crash after all prepare:
Crash after one commit prepared:
Participant timeout behavior:
Decision log location:
Manual recovery command:
Alert threshold:

Project / Capstone Connection

If a capstone proposes distributed transactions, it must include the in-doubt recovery path, not only the happy path.


Case Study 6: Saga Compensation That Is Not Idempotent

Scenario: A travel booking workflow reserves a flight, reserves a hotel, and charges a payment method. The hotel reservation fails after the flight is reserved. The service calls cancelFlight, but a network timeout hides whether the cancellation succeeded. A retry calls cancelFlight again and creates a second refund.

Source anchor: Azure Architecture Center's Compensating Transaction pattern explains that eventually consistent multi-step operations need application-specific undo actions, that compensation can itself fail, and that compensation steps should be idempotent and resumable. The same guidance connects compensating transactions to saga-style workflows. See Azure Compensating Transaction pattern.

Module concepts:

  • saga
  • eventual consistency
  • compensation
  • idempotency
  • workflow state
  • timeout ambiguity

Wrong Approach

"A compensation is just the inverse API call."

The inverse call can fail, time out, or be applied twice. Compensation is a workflow with its own state and correctness rules.

Better Approach

Make every step and compensation idempotent:

Forward step:
reserveFlight(booking_id, flight_id)

Compensation:
cancelFlight(booking_id, flight_id, compensation_id)

State:
pending -> reserved -> cancel_requested -> cancelled

Retries must use stable identifiers. The workflow must be resumable from persisted state, and some failures must route to manual review instead of looping forever.

Tradeoff Table

ChoiceGainCost
2PC across servicesstrong atomic outcometight coupling and availability cost
saga with compensationservice autonomytemporary inconsistency and custom undo logic
retry until successsimple mental modelduplicate side effects without idempotency
workflow state machineresumable and observablemore design and storage work

Failure Mode

The system avoids distributed transactions but creates worse financial inconsistency because compensation is neither idempotent nor observable.

Required Artifact

Design a saga state machine:

Forward steps:
Compensation steps:
Idempotency keys:
Timeout handling:
Retry policy:
Manual review states:
Audit events:
User-facing status:

Project / Capstone Connection

Any capstone using microservices for orders, payments, reservations, or provisioning should include a saga or outbox design with explicit compensation semantics.


Case Study 7: Global Consistency With A Real Clock Cost

Scenario: A globally distributed product wants users in Karachi, London, and New York to see account changes in a single externally consistent order. The team says "just use timestamps" and stores updated_at from each app server.

Source anchor: Google Cloud Spanner documentation explains that TrueTime exposes a bounded clock uncertainty interval and enables Spanner to assign transaction timestamps with external consistency. Spanner's default serializable behavior gives strict transaction ordering, but the system pays coordination and commit-wait costs to make real-time order meaningful. See Spanner TrueTime and external consistency.

Module concepts:

  • linearizability
  • external consistency
  • serializable transactions
  • clock uncertainty
  • commit timestamp
  • distributed coordination cost

Wrong Approach

"If every server writes a timestamp, later timestamps define the global truth."

Ordinary app-server clocks do not give a safe real-time ordering guarantee across machines and regions. Clock skew can make a later operation appear earlier.

Better Approach

Use a system that explicitly defines the consistency contract:

Need:
externally consistent transactions across regions

Mechanism:
bounded clock uncertainty
commit timestamp assignment
commit wait / coordination

Design question:
Which workflows need this guarantee?
Which workflows can use regional primary, causal/session consistency, or async replication?

Tradeoff Table

ChoiceGainCost
app-server timestampssimple and cheapno true global ordering guarantee
regional primaryclear ordering in one regioncross-region latency/failover tradeoff
externally consistent databasestrong global transaction semanticscoordination, latency, and platform cost
eventual consistency + conflict policyhigh availability and localityapp must handle conflicts and stale reads

Failure Mode

Audit logs, balances, or permissions appear to move backward across regions because the timestamp was treated as a consistency protocol.

Required Artifact

Write a consistency contract:

Operation:
Required ordering:
User-visible freshness:
Allowed stale window:
Conflict policy:
Chosen consistency model:
Latency budget:
Why weaker model is insufficient:

Project / Capstone Connection

If a capstone uses multi-region replication, do not say "consistent globally" unless the design names the mechanism and the latency/availability cost.


Source Map

SourceUse it for
PostgreSQL transaction isolationisolation levels, snapshots, serializable behavior, anomalies
PostgreSQL SSIserializable snapshot isolation and write-skew prevention
PostgreSQL serialization failure handlingretrying 40001, deadlock retry considerations, whole-transaction retry
PostgreSQL explicit lockingrow/table locks, deadlocks, lock ordering
PostgreSQL two-phase transactions2PC support, prepared transactions, in-doubt state
PostgreSQL PREPARE TRANSACTIONprepared transaction lifecycle and recovery commands
CockroachDB transaction retry errorsserializable retry behavior in a distributed SQL system
Azure Compensating Transaction patternsagas, compensation, idempotency, resumable workflows
Spanner TrueTime and external consistencyexternal consistency, bounded clock uncertainty, global transaction ordering

Completion Standard

  • At least three case-study artifacts are completed.
  • At least one artifact draws a concurrent transaction history and names the anomaly.
  • At least one artifact includes a retry policy for 40001, deadlock, or unique-conflict behavior.
  • At least one artifact includes a lock-order or wait-for graph.
  • At least one artifact includes a distributed workflow failure matrix or saga state machine.
  • At least one case connects isolation choices to Module 3 replication/partitioning.
  • At least one case connects atomic commit or consistency models to Module 5 distributed-systems fundamentals.