Skip to main content

Sagas: Long-Running Transactions Across Services

What This Concept Is

A saga is a sequence of local transactions across multiple services. Each local transaction commits in its own service immediately (no distributed lock). If a later step fails, the saga runs compensating transactions in reverse order to undo the earlier work semantically.

Sagas exist because distributed ACID (2PC/XA across services and databases) does not scale and is operationally brittle. A saga accepts temporary inconsistency in exchange for availability and throughput.

Core vocabulary:

  • Local transaction: a real ACID transaction inside one service (ReserveStock, ChargeCard).
  • Compensating transaction: the semantic undo of a local transaction (ReleaseStock, RefundCard). Critically, compensations are not rollbacks; the earlier transaction already committed. You publish a new fact that reverses its effect.
  • Saga coordinator: whatever drives the sequence. In a choreography saga, the coordinator is implicit -- each service listens and reacts. In an orchestration saga, an orchestrator (Temporal, Step Functions) holds the state machine explicitly.

Why It Matters Here

Sagas are what you actually do when "a transaction across services" is what you wanted and couldn't have. Every real event-driven system that spans services has sagas, named or not. Getting them right is the central skill of this cluster.

The alternative is invisible coupling and data corruption: half-completed workflows leave orphan state (card charged but no order, stock reserved for a deleted user). Sagas make the compensation story explicit.

Worked Example: Checkout Saga

A checkout workflow: reserve stock, charge card, create shipment. Each in its own service.

Happy path

[saga] --ReserveStock(order)--> [inventory]
<--StockReserved---
[saga] --ChargeCard(order, amt)--> [payment]
<--PaymentCaptured---
[saga] --CreateShipment(order)--> [shipping]
<--ShipmentCreated---
[saga] marks order as confirmed

All three steps commit locally; the saga ends.

Failure path: payment fails after stock is reserved

[saga] --ReserveStock(order)--> [inventory]
<--StockReserved---
[saga] --ChargeCard(order, amt)--> [payment]
<--PaymentDeclined--- (failure!)
[saga] runs compensations in reverse:
[saga] --ReleaseStock(order)--> [inventory]
<--StockReleased---
[saga] marks order as failed, notifies customer

Note the specific things that are NOT happening:

  • the inventory service did not "roll back" -- it performed a new, named, idempotent action (ReleaseStock) that emits StockReleased
  • the payment service did nothing to compensate, because it never charged (its own local transaction aborted cleanly)
  • the saga writes an explicit terminal state; the system of record knows the order is failed, not "stuck"

Failure path: shipment creation fails after charge

[saga] --ReserveStock(order)--> [inventory]
<--StockReserved---
[saga] --ChargeCard(order, amt)--> [payment]
<--PaymentCaptured---
[saga] --CreateShipment(order)--> [shipping]
<--ShippingUnavailable--- (failure!)
[saga] runs compensations in reverse:
[saga] --RefundCard(order)--> [payment]
<--PaymentRefunded---
[saga] --ReleaseStock(order)--> [inventory]
<--StockReleased---
[saga] marks order as cancelled, notifies customer

Observations:

  • compensations run in reverse order of the original steps
  • every compensation is idempotent (it may be retried; it must converge to the same result)
  • there is no "rollback" of the payment; there is a refund, which is an independent committed transaction with its own fact
  • if a compensation itself fails, the saga surfaces a manual intervention state -- it does not silently hang

The Two Shapes of Saga

Choreography saga (no central coordinator)

Each service publishes events; other services subscribe and react, including to compensation events.

  • OrderPlaced -> inventory reserves, publishes StockReserved
  • StockReserved -> payment charges, publishes PaymentCaptured or PaymentDeclined
  • PaymentDeclined -> inventory subscribes, publishes StockReleased

Pro: no orchestrator to run. Con: the saga logic is splattered across services and the full picture exists nowhere.

Orchestration saga (explicit coordinator)

A single workflow (Temporal, Step Functions, Camunda) holds the state machine. It sends commands, waits for replies, and runs compensations declaratively.

  • Pro: the saga is a single piece of code; compensations and retries are first-class.
  • Con: introduces an orchestrator dependency; service APIs must be designed for the orchestrator's command-reply pattern.

Choice follows the tradeoffs of Concept 10. Both are valid sagas.

Non-Negotiable Design Rules

  1. Compensations must be idempotent. They may be retried; they must not "double-refund."
  2. Every step must have a compensation, or be clearly marked "non-compensatable." Sending an email cannot be un-sent; the saga must not take that step before it is safe to commit the rest.
  3. The saga must have a terminal state. Every run ends as committed, compensated, or needs-human. "Pending" forever is a bug.
  4. Correlation IDs on every message. The saga ID travels with every command/event so you can trace a single run end-to-end.
  5. Timeouts at every step. A downstream service that never replies cannot be allowed to freeze the saga. Declare what you do on timeout (retry, compensate, escalate).
  6. Order compensations in reverse. If the original order was A, B, C, compensation is undo C, undo B, undo A.

Common Confusion / Misconception

"A saga is a database transaction." No. A saga is a business-level transaction. Each step is a local transaction; between steps the system is observably inconsistent. That is the tradeoff.

"Compensations guarantee the system returns to the original state." They guarantee a semantically valid final state. If you charged and refunded, the customer saw the charge for a moment. The world (email receipts, card statements, your accountant's ledger) has evidence of both events. Sagas compensate in the business sense, not the physics sense.

"Use 2PC for real consistency." 2PC does not scale across heterogeneous services and does not tolerate participant failure gracefully. Modern distributed systems essentially never use 2PC across service boundaries.

"Choreography sagas are simpler." They are simpler to start; they become much harder to reason about as they grow past 4-5 steps.

"The orchestrator is a SPOF." The orchestrator's state is durable (Temporal persists to a DB; Step Functions is managed). The dependency is real but the availability story is usually fine.

How To Use It

Saga design checklist:

  1. List the steps in the intended order.
  2. For each step, identify its local transaction (what commits in which service).
  3. For each step, design the compensation as its own named, idempotent operation.
  4. Decide choreography or orchestration using Concept 10's guide.
  5. Pick your correlation ID (usually saga_id or the aggregate ID, e.g., order_id).
  6. Declare timeouts and retries per step.
  7. Define terminal states: completed, compensated, needs-human.
  8. Plan observability: how does an operator see the state of a running saga?

Check Yourself

  1. Why is a compensation not the same as a rollback?
  2. Walk the checkout saga through the "payment succeeds but shipping fails" case. Which compensations run in what order?
  3. Why does every saga step need to be idempotent, even in orchestration?
  4. Name one operation that is not compensatable and describe how you design around it (e.g., defer it until safe).

Mini Drill or Application

Design a saga for "process a refund": retrieve order -> verify payment -> refund card -> restock item -> notify customer. In 25 minutes:

  1. List each step's local transaction and compensation.
  2. Decide choreography or orchestration. Justify.
  3. Walk through a failure between step 3 and step 4.
  4. Identify the non-compensatable step(s) and where they must be ordered.
  5. Specify the correlation ID and terminal states.

Transfer to Adjacent Domains

  • Idempotency (Concept 12). Every saga step retries; every compensation may run twice. Idempotency is a hard prerequisite, not a design goal. The saga's correctness proof assumes idempotent participants.
  • Outbox (Concept 06). Sagas are only as reliable as the events that glue them together. A participant without an outbox silently corrupts choreographed sagas (state changes, event never published) and poisons orchestrated sagas (orchestrator times out waiting for completion).
  • Domain-Driven Design (S7M3). Sagas are the canonical way to coordinate across aggregate boundaries without distributed locks. "One aggregate per local transaction; saga across aggregates" is DDD's standard consistency architecture.
  • Compliance / audit. Every compensation is a new fact (refund, release, reverse) with its own audit trail. Accountants prefer this -- "we charged and refunded" is easier to explain than "we rolled back." Design compensations with the auditor's lens in mind.
  • Customer support UX. Pending-state sagas are the source of "order stuck" customer tickets. The terminal-state rule (committed / compensated / needs-human) is customer support's ally; "pending forever" is their enemy.

Read This Only If Stuck