Skip to main content

Saga Design Workshop

Pick one workflow and design a production-plausible saga. The deliverable is a table of forward steps and compensations, plus a written failure runbook.

Retrieval Prompts

  1. Define a saga. Why is it not atomic?
  2. State the two orchestration styles. What is the observability cost of each?
  3. What property must every compensation have?
  4. Name a step whose effect cannot be compensated and describe the workaround.
  5. What is a semantic lock?

Choose a Workflow

Pick one that is concrete enough to feel uncomfortable:

  • A: E-commerce checkout: reserve inventory, charge payment, create shipment, send receipt email.
  • B: Travel booking: reserve flight, reserve hotel, charge card, email itinerary.
  • C: User signup: create user record, create auth credentials, provision default workspace, send welcome email.
  • D: Cross-bank transfer: debit source, credit destination, post to clearing ledger, notify both accounts.

Step 1: Forward Path Table

Make a table with columns: Step, Service, Forward action, Idempotency key, Durable preconditions, Observable intermediate state.

Fill it in for all Ti. Example for workflow A:

StepServiceForward actionIdempotency keyPreconditionsIntermediate state
T1InventoryReserve N units of SKU(order_id, sku)SKU exists; stock >= Ninventory reserved but not committed
T2PaymentsCharge card for $Xorder_idcard token validcharge authorized
T3ShippingCreate shipment recordorder_idT1, T2 succeededshipment pending
T4EmailSend receipt(order_id, email-kind=receipt)T3 successemail enqueued

Step 2: Compensation Table

Make a parallel table with Step, Compensation action, Is it a pure inverse?, Observable side effects that cannot be undone.

StepCompensationPure inverse?Irreversible side effects
T1Release reserved unitsyesnone
T2Issue refundno (new ledger entry)payment processor may charge fees
T3Cancel shipment recordyes if pre-dispatchif already shipped, must recall / RMA
T4Send cancellation emailno (can't un-send)user received both emails

Step 3: Ordering and Idempotency

  • Document the order in which compensations run (reverse of forward steps).
  • For every forward step and every compensation, note the idempotency key and how duplicate deliveries are detected.
  • For every compensation, verify it can be safely retried N times. Write this down; do not assume.

Step 4: Concurrency and Semantic Locks

  • Can two sagas run on the same resource (e.g., the same SKU, the same card) concurrently? If yes, model the race.
  • Decide whether to hold a semantic lock (e.g., mark inventory as "reserved") so other sagas see the pending state and do not double-commit.
  • Decide what happens when a saga times out: does the next read see the intermediate state, and if so, what does it look like?

Step 5: Failure Runbook

Write one runbook entry per failure mode you can enumerate:

  • forward step fails -> run compensations in reverse
  • compensation fails (transient) -> retry with backoff; log to saga state
  • compensation fails (permanent) -> escalate to human queue; describe the human's tools and information needs
  • saga orchestrator crashes mid-flight -> describe how state is recovered on restart
  • duplicate forward step triggered by at-least-once delivery -> demonstrate idempotency

Step 6: Alternative: 2PC Comparison

For the same workflow, sketch what 2PC would look like. Identify:

  • which participants are XA-capable and which are not (e.g., a third-party email provider is not)
  • how long locks would be held during the coordinator's protocol
  • how a coordinator failure would strand participants in doubt
  • why, in practice, sagas are still the pragmatic choice

Compare and Distinguish

  • orchestration vs choreography: which is better for this workflow and why?
  • semantic compensation vs syntactic inverse: where do they diverge for this workflow?
  • saga vs retries with idempotency vs outbox pattern: when is each sufficient?

Common Mistake Check

  1. "My compensation just calls DELETE on the record the forward step created." Shows you did not consider observability, audit trail, and events already emitted.
  2. "If compensation fails, we log and move on." Shows no reasoning about split state; document the escape hatch.
  3. "We'll use eventually consistent events to drive the saga." Fine, but name the ordering guarantee (causal? session? none?).
  4. "Each step is idempotent because we use unique IDs." Show where the idempotency is actually enforced: DB constraint, application check, or broker deduplication?

Evidence Check

Complete only if:

  • Step-by-step forward table and compensation table are written down.
  • Every compensation has an explicit idempotency and retry story.
  • At least one failure mode has a written runbook entry more than one line long.
  • You can defend, in one paragraph, why saga was chosen over 2PC for this workflow.
  • You have named at least one observable intermediate state that the application UI or downstream consumers must handle.