The Outbox Pattern: Atomically Publishing Events
What This Concept Is
You have a service that writes to its database and publishes an event. The naive implementation:
def place_order(order):
db.insert("orders", order) # (1) DB write
broker.publish("order.placed", order) # (2) broker publish
This is the dual-write bug. There is no atomic boundary around (1) and (2):
- if the process crashes between (1) and (2), the order is in the DB but nobody knows
- if (2) is flaky, it fails while (1) succeeded -- same outcome
- if you swap the order (publish first, then DB), a crash leaves an event with no underlying state; downstream systems act on a phantom
The outbox pattern fixes this by making event publication part of the same database transaction as the state change.
BEGIN;
INSERT INTO orders (...);
INSERT INTO outbox (event_id, event_type, payload, created_at, published_at NULL);
COMMIT;
A separate relay process then:
- reads unpublished rows from the
outboxtable - publishes each to the broker
- marks it
published_at = now()(or deletes it)
Because (1) the state change and (2) the outbox insert live in one transaction, either both happen or neither does. The publish is at-least-once (the relay may retry), but the intent to publish is atomic with the state change.
Why It Matters Here
Every pattern in the rest of this module assumes events are reliably emitted. Without the outbox (or an equivalent -- CDC from the DB's WAL, transactional messaging from the broker, event sourcing which sidesteps the problem entirely):
- pub-sub is unreliable at the producer boundary
- sagas miss compensating events and stall
- projections miss state changes and drift
- idempotency on consumers cannot rescue data that was never published
The outbox pattern is the workhorse that makes event-driven architecture trustworthy without giving up a regular database.
Concrete Example
Outbox schema
CREATE TABLE outbox (
event_id UUID PRIMARY KEY,
aggregate_id TEXT NOT NULL, -- e.g. order_id for ordering
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
published_at TIMESTAMPTZ NULL
);
CREATE INDEX outbox_unpublished ON outbox (created_at)
WHERE published_at IS NULL;
Producer (the service doing real work)
def place_order(order):
event = {
"event_id": uuid4(),
"event_type": "OrderPlaced",
"order_id": order.id,
"total_cents": order.total_cents,
...
}
with db.transaction():
db.execute("INSERT INTO orders ...", order)
db.execute(
"INSERT INTO outbox (event_id, aggregate_id, event_type, payload) VALUES (%s, %s, %s, %s)",
event["event_id"], order.id, event["event_type"], json.dumps(event),
)
Relay (a separate worker)
while True:
rows = db.query("""
SELECT event_id, event_type, payload
FROM outbox
WHERE published_at IS NULL
ORDER BY created_at
LIMIT 100
FOR UPDATE SKIP LOCKED
""")
for r in rows:
broker.publish(topic(r.event_type), r.payload) # at-least-once
db.execute(
"UPDATE outbox SET published_at = now() WHERE event_id = %s",
r.event_id,
)
if not rows:
sleep(0.2)
Consumers see every event at least once (publish may be retried if the relay crashes before the UPDATE). The event_id travels with the event so consumers dedupe (Concept 12).
Variants
- Polling relay (shown above): simple, portable, adds a bit of DB load.
- Transactional log tail / CDC: point Debezium or similar at the outbox table's WAL; no polling, near-real-time, vendor-specific.
- Listen/notify: Postgres
LISTEN/NOTIFYwakes the relay when rows arrive; reduces latency but still needs polling as a safety net.
All three have the same atomicity property, which is the thing that matters.
Common Confusion / Misconception
"Just use a distributed transaction between DB and broker (XA / 2PC)." Theoretically yes; practically, most modern brokers (Kafka, SQS, NATS) do not support XA with your database, and even when they do (some AMQP brokers plus some DBs) the operational cost is enormous. The industry moved to outbox specifically to avoid XA.
"If the relay is slow, consumers see stale events." Yes. The relay is a latency cost -- typically tens of milliseconds with CDC, hundreds with polling. For the use cases that want the outbox, that latency is acceptable. If you need synchronous publication, you do not want at-most-once consistency; you want a command reply.
"I can just publish after the COMMIT -- the DB is now the source of truth." A crash between COMMIT and publish still loses the event. The publish-after-commit window is smaller than the pre-commit one, but it is not zero. Many bugs have hidden in exactly that window.
"The outbox table will grow forever." Truncate or archive rows older than a bounded horizon (e.g., 7 days after published_at). The outbox is an integration buffer, not an audit log. Keep the events themselves on the broker (Kafka with long retention) or in a dedicated event store if you need history.
"I do not need event_id if the DB primary key is the order ID." You do. The event ID is used by consumers for dedup, including redelivery after relay retries. It is distinct from the aggregate ID.
How To Use It
Adoption checklist:
- Add an
outboxtable withevent_idPK,event_type,payload,created_at,published_at. - Inside every write transaction that should produce events, insert the corresponding outbox row(s) in the same transaction.
- Deploy a relay that polls (or tails the WAL) and publishes at-least-once.
- Give consumers a dedup strategy keyed on
event_id(Concept 12). - Monitor outbox lag (
max(now() - created_at) WHERE published_at IS NULL) and alert when it exceeds a threshold (e.g., 30 s).
Check Yourself
- Why does publishing after
COMMITstill have a race window, and how small is the window in practice? - What happens if the relay crashes after publishing to the broker but before updating
published_at? Is that a bug? - What would go wrong if you used
DELETE FROM outboxinstead ofUPDATE ... SET published_at?
Mini Drill or Application
In 20 minutes:
- Take a service you know that does "write DB then call external API / publish." Sketch the dual-write failure modes in 3-5 sentences.
- Write the outbox DDL for that service and the single SQL transaction that replaces the dual write.
- Sketch the relay in 10-15 lines of pseudocode, including the
FOR UPDATE SKIP LOCKEDdetail. - Decide where dedup happens on the consumer side.
Transfer to Adjacent Domains
- Idempotency (Concept 12). The outbox guarantees at-least-once publication. Consumers must therefore be idempotent; the two patterns are complementary halves of reliable event delivery and should be designed together, not sequentially.
- Sagas (Concept 11). Every saga participant that writes DB state and emits an event needs an outbox -- otherwise a participant crash leaves the saga in an unrecoverable "state changed, event missed" limbo.
- Event sourcing (Concept 13). Event-sourced systems sidestep the dual-write problem entirely (the event is the write). If you're in an event-sourced bounded context, you don't need an outbox; if you're CRUD+events, you do.
- Data platform (S6). Teams feeding Snowflake/Databricks via CDC routinely reinvent outbox mechanics in ELT pipelines. Centralizing on one outbox with a schema contract is cheaper than per-pipeline de-duplication.
- Operational SRE. Outbox lag is a first-class SLI for event-driven services:
max(now() - created_at) WHERE published_at IS NULL. Alert on it and you catch most publication regressions before customers do.
Read This Only If Stuck
- Richards & Ford: Preventing Data Loss -- the broader durability discussion that motivates outbox
- Richards & Ford: Event-Driven Architecture Style -- where the broker sits in the topology
- System Design Primer: Asynchronism -- queue basics that help reason about relay behavior
- System Design Primer: Consistency patterns -- frames why atomicity across two systems is hard without outbox
- Microservices.io: Transactional outbox -- the pattern as cataloged by Chris Richardson
- Microservices.io: Transaction log tailing -- the CDC variant of the same idea
- Microservices.io: Polling publisher -- the simplest relay implementation when CDC is unavailable
- Confluent: The Transactional Outbox Pattern (microservices course) -- video + code walkthrough with Kafka
- Debezium documentation: Outbox event router -- the most common production CDC implementation
- Debezium: PostgreSQL connector -- the details of tailing a WAL-based outbox