Sagas: Long-Running Transactions Across Services
What This Concept Is
A saga is a sequence of local transactions across multiple services. Each local transaction commits in its own service immediately (no distributed lock). If a later step fails, the saga runs compensating transactions in reverse order to undo the earlier work semantically.
Sagas exist because distributed ACID (2PC/XA across services and databases) does not scale and is operationally brittle. A saga accepts temporary inconsistency in exchange for availability and throughput.
Core vocabulary:
- Local transaction: a real ACID transaction inside one service (
ReserveStock,ChargeCard). - Compensating transaction: the semantic undo of a local transaction (
ReleaseStock,RefundCard). Critically, compensations are not rollbacks; the earlier transaction already committed. You publish a new fact that reverses its effect. - Saga coordinator: whatever drives the sequence. In a choreography saga, the coordinator is implicit -- each service listens and reacts. In an orchestration saga, an orchestrator (Temporal, Step Functions) holds the state machine explicitly.
Why It Matters Here
Sagas are what you actually do when "a transaction across services" is what you wanted and couldn't have. Every real event-driven system that spans services has sagas, named or not. Getting them right is the central skill of this cluster.
The alternative is invisible coupling and data corruption: half-completed workflows leave orphan state (card charged but no order, stock reserved for a deleted user). Sagas make the compensation story explicit.
Worked Example: Checkout Saga
A checkout workflow: reserve stock, charge card, create shipment. Each in its own service.
Happy path
[saga] --ReserveStock(order)--> [inventory]
<--StockReserved---
[saga] --ChargeCard(order, amt)--> [payment]
<--PaymentCaptured---
[saga] --CreateShipment(order)--> [shipping]
<--ShipmentCreated---
[saga] marks order as confirmed
All three steps commit locally; the saga ends.
Failure path: payment fails after stock is reserved
[saga] --ReserveStock(order)--> [inventory]
<--StockReserved---
[saga] --ChargeCard(order, amt)--> [payment]
<--PaymentDeclined--- (failure!)
[saga] runs compensations in reverse:
[saga] --ReleaseStock(order)--> [inventory]
<--StockReleased---
[saga] marks order as failed, notifies customer
Note the specific things that are NOT happening:
- the inventory service did not "roll back" -- it performed a new, named, idempotent action (
ReleaseStock) that emitsStockReleased - the payment service did nothing to compensate, because it never charged (its own local transaction aborted cleanly)
- the saga writes an explicit terminal state; the system of record knows the order is failed, not "stuck"
Failure path: shipment creation fails after charge
[saga] --ReserveStock(order)--> [inventory]
<--StockReserved---
[saga] --ChargeCard(order, amt)--> [payment]
<--PaymentCaptured---
[saga] --CreateShipment(order)--> [shipping]
<--ShippingUnavailable--- (failure!)
[saga] runs compensations in reverse:
[saga] --RefundCard(order)--> [payment]
<--PaymentRefunded---
[saga] --ReleaseStock(order)--> [inventory]
<--StockReleased---
[saga] marks order as cancelled, notifies customer
Observations:
- compensations run in reverse order of the original steps
- every compensation is idempotent (it may be retried; it must converge to the same result)
- there is no "rollback" of the payment; there is a refund, which is an independent committed transaction with its own fact
- if a compensation itself fails, the saga surfaces a manual intervention state -- it does not silently hang
The Two Shapes of Saga
Choreography saga (no central coordinator)
Each service publishes events; other services subscribe and react, including to compensation events.
OrderPlaced-> inventory reserves, publishesStockReservedStockReserved-> payment charges, publishesPaymentCapturedorPaymentDeclinedPaymentDeclined-> inventory subscribes, publishesStockReleased
Pro: no orchestrator to run. Con: the saga logic is splattered across services and the full picture exists nowhere.
Orchestration saga (explicit coordinator)
A single workflow (Temporal, Step Functions, Camunda) holds the state machine. It sends commands, waits for replies, and runs compensations declaratively.
- Pro: the saga is a single piece of code; compensations and retries are first-class.
- Con: introduces an orchestrator dependency; service APIs must be designed for the orchestrator's command-reply pattern.
Choice follows the tradeoffs of Concept 10. Both are valid sagas.
Non-Negotiable Design Rules
- Compensations must be idempotent. They may be retried; they must not "double-refund."
- Every step must have a compensation, or be clearly marked "non-compensatable." Sending an email cannot be un-sent; the saga must not take that step before it is safe to commit the rest.
- The saga must have a terminal state. Every run ends as
committed,compensated, orneeds-human. "Pending" forever is a bug. - Correlation IDs on every message. The saga ID travels with every command/event so you can trace a single run end-to-end.
- Timeouts at every step. A downstream service that never replies cannot be allowed to freeze the saga. Declare what you do on timeout (retry, compensate, escalate).
- Order compensations in reverse. If the original order was A, B, C, compensation is
undo C, undo B, undo A.
Common Confusion / Misconception
"A saga is a database transaction." No. A saga is a business-level transaction. Each step is a local transaction; between steps the system is observably inconsistent. That is the tradeoff.
"Compensations guarantee the system returns to the original state." They guarantee a semantically valid final state. If you charged and refunded, the customer saw the charge for a moment. The world (email receipts, card statements, your accountant's ledger) has evidence of both events. Sagas compensate in the business sense, not the physics sense.
"Use 2PC for real consistency." 2PC does not scale across heterogeneous services and does not tolerate participant failure gracefully. Modern distributed systems essentially never use 2PC across service boundaries.
"Choreography sagas are simpler." They are simpler to start; they become much harder to reason about as they grow past 4-5 steps.
"The orchestrator is a SPOF." The orchestrator's state is durable (Temporal persists to a DB; Step Functions is managed). The dependency is real but the availability story is usually fine.
How To Use It
Saga design checklist:
- List the steps in the intended order.
- For each step, identify its local transaction (what commits in which service).
- For each step, design the compensation as its own named, idempotent operation.
- Decide choreography or orchestration using Concept 10's guide.
- Pick your correlation ID (usually
saga_idor the aggregate ID, e.g.,order_id). - Declare timeouts and retries per step.
- Define terminal states:
completed,compensated,needs-human. - Plan observability: how does an operator see the state of a running saga?
Check Yourself
- Why is a compensation not the same as a rollback?
- Walk the checkout saga through the "payment succeeds but shipping fails" case. Which compensations run in what order?
- Why does every saga step need to be idempotent, even in orchestration?
- Name one operation that is not compensatable and describe how you design around it (e.g., defer it until safe).
Mini Drill or Application
Design a saga for "process a refund": retrieve order -> verify payment -> refund card -> restock item -> notify customer. In 25 minutes:
- List each step's local transaction and compensation.
- Decide choreography or orchestration. Justify.
- Walk through a failure between step 3 and step 4.
- Identify the non-compensatable step(s) and where they must be ordered.
- Specify the correlation ID and terminal states.
Transfer to Adjacent Domains
- Idempotency (Concept 12). Every saga step retries; every compensation may run twice. Idempotency is a hard prerequisite, not a design goal. The saga's correctness proof assumes idempotent participants.
- Outbox (Concept 06). Sagas are only as reliable as the events that glue them together. A participant without an outbox silently corrupts choreographed sagas (state changes, event never published) and poisons orchestrated sagas (orchestrator times out waiting for completion).
- Domain-Driven Design (S7M3). Sagas are the canonical way to coordinate across aggregate boundaries without distributed locks. "One aggregate per local transaction; saga across aggregates" is DDD's standard consistency architecture.
- Compliance / audit. Every compensation is a new fact (refund, release, reverse) with its own audit trail. Accountants prefer this -- "we charged and refunded" is easier to explain than "we rolled back." Design compensations with the auditor's lens in mind.
- Customer support UX. Pending-state sagas are the source of "order stuck" customer tickets. The terminal-state rule (
committed/compensated/needs-human) is customer support's ally; "pending forever" is their enemy.
Read This Only If Stuck
- Richards & Ford: Mediator Topology -- orchestrator-driven sagas in mediator style
- Richards & Ford: Event-Driven Architecture Style -- broker-topology choreography that underlies choreographed sagas
- Richards & Ford: Preventing Data Loss -- durability considerations for the saga's intermediate facts
- System Design Primer: Consistency patterns -- eventual-consistency framing that sagas live inside
- System Design Primer: Availability patterns -- how sagas trade availability for correctness by avoiding 2PC
- Microservices.io: Saga pattern -- canonical pattern catalog, both variants with pros/cons
- AWS: Saga pattern (prescriptive guidance) -- structured treatment with compensating-transaction guidance
- AWS: Implement serverless saga with Step Functions -- concrete implementation recipe
- Temporal: Saga pattern made easy -- code-first orchestration variant
- Temporal: Compensating actions -- how compensations are registered and run in a real orchestrator
- Caitie McCaffrey: Applying the Saga Pattern -- classic talk reviving the 1987 saga paper for modern microservices
- Garcia-Molina & Salem: SAGAS (1987) -- original paper; optional deep dive