Saga and Idempotency Clinic
Retrieval Prompts
- State the one-sentence difference between a saga and a distributed transaction.
- Explain why a compensation is not a rollback.
- State the rule "at-least-once delivery + X = effectively exactly-once" -- what is
X? - Name three naturally idempotent operations and one that is fundamentally not.
- State the terminal states every saga must have.
Compare and Distinguish
- saga vs 2PC / XA (availability, failure handling, operational cost)
- choreography saga vs orchestration saga (who holds the state machine)
- compensation vs rollback (committed-new-fact vs undo-the-uncommitted)
- idempotency vs deduplication (property of the operation vs detection of the duplicate)
- broker "exactly once" vs effective exactly once (within-broker vs end-to-end)
Common Mistake Check
For each, name the error:
- "We retry failed commands until they succeed, so we don't need idempotency."
- "Our compensation function reverses the previous DB transaction."
- "We use Kafka EOS, so our DB writes are exactly-once."
- "Our saga's compensation for
PaymentCapturedis a newPaymentCapturedwith negative amount." - "Our saga has
pending,completed,failedstates -- looks complete." - "The dedup cache has a 60-second TTL because messages are fast."
- "Our orchestrator retries indefinitely; no need for timeouts."
- "We do not need idempotency keys on Stripe because we call it only once."
Mini Application 1 -- Design a Three-Step Saga
Design the saga for refund processing:
- Step 1: verify the order is eligible for refund (in
OrderService) - Step 2: refund the card (in
PaymentService) - Step 3: restock the item (in
InventoryService)
In 30 minutes:
A. Happy path
Write the sequence of commands and events for the successful path.
B. Failure paths
For each of these, describe what happens:
- Step 1 fails (not eligible)
- Step 2 fails (card refund declined)
- Step 3 fails (SKU not found to restock)
- Step 2 succeeds, Step 3 fails
For each, list:
- which compensations run
- in what order
- the terminal state (
completed,compensated,needs-human)
C. Compensations
For each step, write the compensation as a named, idempotent operation. Note which step is non-compensatable (if any) and justify its placement.
D. Structural decisions
- Choreography or orchestration? Why?
- Correlation ID (saga_id) source?
- Retry policy per step?
- Timeout per step?
E. Observability
- How does support see "refund_saga for order ord_9f2a is stuck at step 2"?
- What metric signals saga health?
Mini Application 2 -- Make a Consumer Idempotent
Take a consumer that today is not idempotent:
def on_payment_captured(event):
# charges fees and creates a ledger entry
fees = compute_fees(event)
charge_merchant_fee(event.merchant_id, fees) # calls external API
db.insert("ledger", ...) # writes ledger
send_receipt_email(event.customer_id, event.amount)
In 30 minutes, redesign it:
- Identify every non-idempotent effect (external calls, DB writes, email sends).
- Pick a dedup key for the consumer as a whole.
- Pick the dedup store and TTL; justify.
- Rewrite the handler so that dedup + effects commit in one transaction.
- For each external call, say whether it needs an idempotency key and what the deterministic key would be.
- Describe what happens if the broker redelivers the same event 5 seconds after the first attempt. After 5 hours.
Mini Application 3 -- Ordering and Partitioning Drill
A team is building a saga orchestrator that consumes events from four topics:
order.placed(partitioned by order_id, 12 partitions)payment.captured(partitioned by order_id, 12 partitions)stock.reserved(partitioned by sku, 8 partitions)shipment.created(partitioned by order_id, 6 partitions)
The orchestrator is occasionally seeing PaymentCaptured before OrderPlaced.
- What is the most likely cause?
- How does the partitioning choice on
stock.reserved(bysku) affect order-level sagas? - What is the fix that restores the orchestrator's ability to reason per order?
- What assumptions about retries and redelivery must the orchestrator make given at-least-once delivery?
(Expected: inconsistent keying across topics routes events for the same order through different partitions; different consumers within the orchestrator group handle them independently; the orchestrator must tolerate out-of-order arrivals by buffering or by partitioning all saga-relevant topics by order_id.)
Worked Compensation Trace (Checkout Saga)
Walk through this failure without looking at Concept 11 first. Then compare.
Setup: saga steps are ReserveStock, ChargeCard, CreateShipment. Saga is orchestrated.
Scenario: ReserveStock succeeds. ChargeCard succeeds. CreateShipment fails with CarrierUnavailable.
Write:
- the compensations in order
- the facts published by each compensating step
- the terminal state
- which steps must be idempotent, and why
Evidence Check
This page is complete only when:
- You designed the refund saga with all failure paths, compensations, and terminal states.
- You rewrote the non-idempotent consumer with dedup + idempotency keys on external calls.
- You diagnosed the partitioning/ordering bug and proposed a fix.
- You walked the checkout saga's "shipment fails" case end-to-end without looking at Concept 11.
- You can explain in plain English why "exactly once" is a systems-level property, not a broker feature.