Saga Design Workshop
Pick one workflow and design a production-plausible saga. The deliverable is a table of forward steps and compensations, plus a written failure runbook.
Retrieval Prompts
- Define a saga. Why is it not atomic?
- State the two orchestration styles. What is the observability cost of each?
- What property must every compensation have?
- Name a step whose effect cannot be compensated and describe the workaround.
- What is a semantic lock?
Choose a Workflow
Pick one that is concrete enough to feel uncomfortable:
- A: E-commerce checkout: reserve inventory, charge payment, create shipment, send receipt email.
- B: Travel booking: reserve flight, reserve hotel, charge card, email itinerary.
- C: User signup: create user record, create auth credentials, provision default workspace, send welcome email.
- D: Cross-bank transfer: debit source, credit destination, post to clearing ledger, notify both accounts.
Step 1: Forward Path Table
Make a table with columns: Step, Service, Forward action, Idempotency key, Durable preconditions, Observable intermediate state.
Fill it in for all Ti. Example for workflow A:
| Step | Service | Forward action | Idempotency key | Preconditions | Intermediate state |
|---|---|---|---|---|---|
| T1 | Inventory | Reserve N units of SKU | (order_id, sku) | SKU exists; stock >= N | inventory reserved but not committed |
| T2 | Payments | Charge card for $X | order_id | card token valid | charge authorized |
| T3 | Shipping | Create shipment record | order_id | T1, T2 succeeded | shipment pending |
| T4 | Send receipt | (order_id, email-kind=receipt) | T3 success | email enqueued |
Step 2: Compensation Table
Make a parallel table with Step, Compensation action, Is it a pure inverse?, Observable side effects that cannot be undone.
| Step | Compensation | Pure inverse? | Irreversible side effects |
|---|---|---|---|
| T1 | Release reserved units | yes | none |
| T2 | Issue refund | no (new ledger entry) | payment processor may charge fees |
| T3 | Cancel shipment record | yes if pre-dispatch | if already shipped, must recall / RMA |
| T4 | Send cancellation email | no (can't un-send) | user received both emails |
Step 3: Ordering and Idempotency
- Document the order in which compensations run (reverse of forward steps).
- For every forward step and every compensation, note the idempotency key and how duplicate deliveries are detected.
- For every compensation, verify it can be safely retried N times. Write this down; do not assume.
Step 4: Concurrency and Semantic Locks
- Can two sagas run on the same resource (e.g., the same SKU, the same card) concurrently? If yes, model the race.
- Decide whether to hold a semantic lock (e.g., mark inventory as "reserved") so other sagas see the pending state and do not double-commit.
- Decide what happens when a saga times out: does the next read see the intermediate state, and if so, what does it look like?
Step 5: Failure Runbook
Write one runbook entry per failure mode you can enumerate:
- forward step fails -> run compensations in reverse
- compensation fails (transient) -> retry with backoff; log to saga state
- compensation fails (permanent) -> escalate to human queue; describe the human's tools and information needs
- saga orchestrator crashes mid-flight -> describe how state is recovered on restart
- duplicate forward step triggered by at-least-once delivery -> demonstrate idempotency
Step 6: Alternative: 2PC Comparison
For the same workflow, sketch what 2PC would look like. Identify:
- which participants are XA-capable and which are not (e.g., a third-party email provider is not)
- how long locks would be held during the coordinator's protocol
- how a coordinator failure would strand participants in doubt
- why, in practice, sagas are still the pragmatic choice
Compare and Distinguish
- orchestration vs choreography: which is better for this workflow and why?
- semantic compensation vs syntactic inverse: where do they diverge for this workflow?
- saga vs retries with idempotency vs outbox pattern: when is each sufficient?
Common Mistake Check
- "My compensation just calls DELETE on the record the forward step created." Shows you did not consider observability, audit trail, and events already emitted.
- "If compensation fails, we log and move on." Shows no reasoning about split state; document the escape hatch.
- "We'll use eventually consistent events to drive the saga." Fine, but name the ordering guarantee (causal? session? none?).
- "Each step is idempotent because we use unique IDs." Show where the idempotency is actually enforced: DB constraint, application check, or broker deduplication?
Evidence Check
Complete only if:
- Step-by-step forward table and compensation table are written down.
- Every compensation has an explicit idempotency and retry story.
- At least one failure mode has a written runbook entry more than one line long.
- You can defend, in one paragraph, why saga was chosen over 2PC for this workflow.
- You have named at least one observable intermediate state that the application UI or downstream consumers must handle.