Sagas: Long-Running Transactions with Compensations
What This Concept Is
A saga is a sequence of local transactions, each committed at a different service, tied together by compensating actions that undo the effect of earlier steps if a later step fails. The term originates in Garcia-Molina and Salem's 1987 paper, originally about long-running transactions that would hold database locks for hours if done as a single transaction.
Structure:
- A saga has forward steps
T1, T2, ..., Tn, each a local ACID transaction at some service. - Each step
Tihas a compensating actionCi, also a local transaction, whose semantic effect undoesTi. - Execution runs forward steps in order. If any step fails, compensations
C_{i-1}, C_{i-2}, ..., C_1run in reverse order. - Compensations must be semantically correct, not necessarily syntactic inverses. Refunding a credit card payment is a compensation for charging it, even though the two records are both positive entries in a ledger.
Two orchestration styles:
- Orchestration: a central saga coordinator drives the sequence, calling each step and handling compensation on failure. Clearer flow, single point of reasoning, but a new component.
- Choreography: each service reacts to events, publishing new events on completion. No central coordinator. More decoupled but harder to reason about and debug.
Sagas do not provide isolation. Intermediate states are visible. They are essentially "ACID per step, BASE across steps."
Why It Matters Here
When you cannot use 2PC (cross-organization, cross-cloud, long-lived) but still need cross-service business correctness, saga is the pattern. It shows up in every order-processing, travel-booking, payment-and-shipment, and onboarding workflow in a microservice architecture. Knowing how to design compensations correctly is the difference between shipping a reliable workflow and shipping an eventual-bug-farm.
Concrete Example
Trip booking: reserve flight, reserve hotel, charge credit card. Each is a separate service.
Forward steps:
T1: reserve flight seat.T2: reserve hotel room.T3: charge credit card.
Compensations:
C1: release flight reservation.C2: release hotel reservation.C3: refund card (in practice: issue credit; the original charge record stays but is canceled by the refund).
Happy path
T1 -> T2 -> T3 -> done.
Failure mid-saga
T1 -> T2 -> T3 fails (card declined). Run C2 -> C1. Trip is not booked; hotel and flight released.
Tricky case: failure during compensation
T1 -> T2 -> T3 fails. Compensate: C2 fails (hotel service down). Now the saga must retry C2 until it succeeds, or escalate to a manual review queue. Compensations must be idempotent and eventually succeed. If a compensation can genuinely fail forever, the saga needs a human escape hatch.
Isolation caveat
Another user might see the intermediate state where flight was reserved but not yet charged, and pay attention to that. If a second saga starts and reads the reserved seat, it might decide the plane is full. This is sagas' lack of isolation. Mitigations: "semantic locks" (flag the reservation as pending so other sagas see it is not final), or compensation-aware reads.
Common Confusion / Misconception
"Sagas are atomic." They are not atomic. They are eventually-consistent workflows that converge (forward) or un-converge (via compensations). Intermediate states are visible.
"Every step's compensation is just its inverse." Semantically yes, syntactically often no. Refunding a charge is not an UPDATE that reverts a balance to a prior value; it is a new ledger entry. Cancelling a confirmed email cannot un-send it; it must send a follow-up email.
"Orchestration is always better than choreography." Orchestration is easier to reason about but creates a coordinator. Choreography is more loosely coupled but hard to follow when things go wrong. Teams routinely underestimate the observability cost of choreographed sagas.
"Sagas replace transactions." Within one service, still use a local ACID transaction. Sagas coordinate between services; they are not a replacement for local correctness.
How To Use It
Designing a saga:
- List forward steps
T1 ... Tn, each a single-service local transaction. - For each, write its compensation
Ci. If you cannot define a semantic compensation, that step cannot be in a saga. - Make every step and every compensation idempotent: retried messages cannot cause duplicate work or double compensation.
- Make every state durable at the orchestrator (or per-service event log).
- Define the failure policy for compensation itself: retry with backoff, then human queue.
- Define the semantic locks so other sagas see pending state and do not double-book.
- Write runbooks for the stuck-saga cases.
Check Yourself
- Why is a saga not atomic?
- What property must a compensation have to handle retries safely?
- Name a step whose effect cannot be compensated, and explain why a saga cannot include it without a workaround.
- When would you pick orchestration over choreography?
- What is a "semantic lock" and why might a saga need one?
Mini Drill or Application
For each workflow, draft the saga: list forward steps and their compensations, and note any step that resists compensation.
- E-commerce checkout: reserve inventory, charge card, create shipment, send receipt email.
- Insurance claim: open claim, request documents, disburse payment, close claim.
- User signup: create user record, create auth credentials, provision default workspace, send welcome email.
- Payroll: compute amounts, initiate bank transfers across many recipients, update ledger.
- Ride-share: match driver, start trip, process payment, rate driver.
Read This Only If Stuck
- DDIA: Distributed transactions in practice (part 1)
- DDIA: Distributed transactions in practice (part 2)
- Database Internals: Coordination avoidance
- Database System Concepts: Commit protocols (part 5)
- External: Garcia-Molina & Salem, "Sagas" (Princeton, 1987)
- External: Chris Richardson's Microservices.io: Saga pattern