Skip to main content

Saga and Idempotency Clinic

Retrieval Prompts

  1. State the one-sentence difference between a saga and a distributed transaction.
  2. Explain why a compensation is not a rollback.
  3. State the rule "at-least-once delivery + X = effectively exactly-once" -- what is X?
  4. Name three naturally idempotent operations and one that is fundamentally not.
  5. State the terminal states every saga must have.

Compare and Distinguish

  • saga vs 2PC / XA (availability, failure handling, operational cost)
  • choreography saga vs orchestration saga (who holds the state machine)
  • compensation vs rollback (committed-new-fact vs undo-the-uncommitted)
  • idempotency vs deduplication (property of the operation vs detection of the duplicate)
  • broker "exactly once" vs effective exactly once (within-broker vs end-to-end)

Common Mistake Check

For each, name the error:

  1. "We retry failed commands until they succeed, so we don't need idempotency."
  2. "Our compensation function reverses the previous DB transaction."
  3. "We use Kafka EOS, so our DB writes are exactly-once."
  4. "Our saga's compensation for PaymentCaptured is a new PaymentCaptured with negative amount."
  5. "Our saga has pending, completed, failed states -- looks complete."
  6. "The dedup cache has a 60-second TTL because messages are fast."
  7. "Our orchestrator retries indefinitely; no need for timeouts."
  8. "We do not need idempotency keys on Stripe because we call it only once."

Mini Application 1 -- Design a Three-Step Saga

Design the saga for refund processing:

  • Step 1: verify the order is eligible for refund (in OrderService)
  • Step 2: refund the card (in PaymentService)
  • Step 3: restock the item (in InventoryService)

In 30 minutes:

A. Happy path

Write the sequence of commands and events for the successful path.

B. Failure paths

For each of these, describe what happens:

  • Step 1 fails (not eligible)
  • Step 2 fails (card refund declined)
  • Step 3 fails (SKU not found to restock)
  • Step 2 succeeds, Step 3 fails

For each, list:

  • which compensations run
  • in what order
  • the terminal state (completed, compensated, needs-human)

C. Compensations

For each step, write the compensation as a named, idempotent operation. Note which step is non-compensatable (if any) and justify its placement.

D. Structural decisions

  • Choreography or orchestration? Why?
  • Correlation ID (saga_id) source?
  • Retry policy per step?
  • Timeout per step?

E. Observability

  • How does support see "refund_saga for order ord_9f2a is stuck at step 2"?
  • What metric signals saga health?

Mini Application 2 -- Make a Consumer Idempotent

Take a consumer that today is not idempotent:

def on_payment_captured(event):
# charges fees and creates a ledger entry
fees = compute_fees(event)
charge_merchant_fee(event.merchant_id, fees) # calls external API
db.insert("ledger", ...) # writes ledger
send_receipt_email(event.customer_id, event.amount)

In 30 minutes, redesign it:

  1. Identify every non-idempotent effect (external calls, DB writes, email sends).
  2. Pick a dedup key for the consumer as a whole.
  3. Pick the dedup store and TTL; justify.
  4. Rewrite the handler so that dedup + effects commit in one transaction.
  5. For each external call, say whether it needs an idempotency key and what the deterministic key would be.
  6. Describe what happens if the broker redelivers the same event 5 seconds after the first attempt. After 5 hours.

Mini Application 3 -- Ordering and Partitioning Drill

A team is building a saga orchestrator that consumes events from four topics:

  • order.placed (partitioned by order_id, 12 partitions)
  • payment.captured (partitioned by order_id, 12 partitions)
  • stock.reserved (partitioned by sku, 8 partitions)
  • shipment.created (partitioned by order_id, 6 partitions)

The orchestrator is occasionally seeing PaymentCaptured before OrderPlaced.

  1. What is the most likely cause?
  2. How does the partitioning choice on stock.reserved (by sku) affect order-level sagas?
  3. What is the fix that restores the orchestrator's ability to reason per order?
  4. What assumptions about retries and redelivery must the orchestrator make given at-least-once delivery?

(Expected: inconsistent keying across topics routes events for the same order through different partitions; different consumers within the orchestrator group handle them independently; the orchestrator must tolerate out-of-order arrivals by buffering or by partitioning all saga-relevant topics by order_id.)

Worked Compensation Trace (Checkout Saga)

Walk through this failure without looking at Concept 11 first. Then compare.

Setup: saga steps are ReserveStock, ChargeCard, CreateShipment. Saga is orchestrated.

Scenario: ReserveStock succeeds. ChargeCard succeeds. CreateShipment fails with CarrierUnavailable.

Write:

  • the compensations in order
  • the facts published by each compensating step
  • the terminal state
  • which steps must be idempotent, and why

Evidence Check

This page is complete only when:

  • You designed the refund saga with all failure paths, compensations, and terminal states.
  • You rewrote the non-idempotent consumer with dedup + idempotency keys on external calls.
  • You diagnosed the partitioning/ordering bug and proposed a fix.
  • You walked the checkout saga's "shipment fails" case end-to-end without looking at Concept 11.
  • You can explain in plain English why "exactly once" is a systems-level property, not a broker feature.