The Outbox Pattern: Atomically Publishing Events

What This Concept Is

You have a service that writes to its database and publishes an event. The naive implementation:

def place_order(order):
    db.insert("orders", order)            # (1) DB write
    broker.publish("order.placed", order) # (2) broker publish

This is the dual-write bug. There is no atomic boundary around (1) and (2):

if the process crashes between (1) and (2), the order is in the DB but nobody knows
if (2) is flaky, it fails while (1) succeeded -- same outcome
if you swap the order (publish first, then DB), a crash leaves an event with no underlying state; downstream systems act on a phantom

The outbox pattern fixes this by making event publication part of the same database transaction as the state change.

BEGIN;
  INSERT INTO orders (...);
  INSERT INTO outbox (event_id, event_type, payload, created_at, published_at NULL);
COMMIT;

A separate relay process then:

reads unpublished rows from the outbox table
publishes each to the broker
marks it published_at = now() (or deletes it)

Because (1) the state change and (2) the outbox insert live in one transaction, either both happen or neither does. The publish is at-least-once (the relay may retry), but the intent to publish is atomic with the state change.

Why It Matters Here

Every pattern in the rest of this module assumes events are reliably emitted. Without the outbox (or an equivalent -- CDC from the DB's WAL, transactional messaging from the broker, event sourcing which sidesteps the problem entirely):

pub-sub is unreliable at the producer boundary
sagas miss compensating events and stall
projections miss state changes and drift
idempotency on consumers cannot rescue data that was never published

The outbox pattern is the workhorse that makes event-driven architecture trustworthy without giving up a regular database.

Concrete Example

Outbox schema

CREATE TABLE outbox (
  event_id       UUID PRIMARY KEY,
  aggregate_id   TEXT NOT NULL,          -- e.g. order_id for ordering
  event_type     TEXT NOT NULL,
  payload        JSONB NOT NULL,
  created_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
  published_at   TIMESTAMPTZ NULL
);

CREATE INDEX outbox_unpublished ON outbox (created_at)
  WHERE published_at IS NULL;

Producer (the service doing real work)

def place_order(order):
    event = {
      "event_id": uuid4(),
      "event_type": "OrderPlaced",
      "order_id": order.id,
      "total_cents": order.total_cents,
      ...
    }
    with db.transaction():
        db.execute("INSERT INTO orders ...", order)
        db.execute(
          "INSERT INTO outbox (event_id, aggregate_id, event_type, payload) VALUES (%s, %s, %s, %s)",
          event["event_id"], order.id, event["event_type"], json.dumps(event),
        )

Relay (a separate worker)

while True:
    rows = db.query("""
      SELECT event_id, event_type, payload
      FROM outbox
      WHERE published_at IS NULL
      ORDER BY created_at
      LIMIT 100
      FOR UPDATE SKIP LOCKED
    """)
    for r in rows:
        broker.publish(topic(r.event_type), r.payload)  # at-least-once
        db.execute(
          "UPDATE outbox SET published_at = now() WHERE event_id = %s",
          r.event_id,
        )
    if not rows:
        sleep(0.2)

Consumers see every event at least once (publish may be retried if the relay crashes before the UPDATE). The event_id travels with the event so consumers dedupe (Concept 12).

Variants

Polling relay (shown above): simple, portable, adds a bit of DB load.
Transactional log tail / CDC: point Debezium or similar at the outbox table's WAL; no polling, near-real-time, vendor-specific.
Listen/notify: Postgres LISTEN/NOTIFY wakes the relay when rows arrive; reduces latency but still needs polling as a safety net.

All three have the same atomicity property, which is the thing that matters.

Common Confusion / Misconception

"Just use a distributed transaction between DB and broker (XA / 2PC)." Theoretically yes; practically, most modern brokers (Kafka, SQS, NATS) do not support XA with your database, and even when they do (some AMQP brokers plus some DBs) the operational cost is enormous. The industry moved to outbox specifically to avoid XA.

"If the relay is slow, consumers see stale events." Yes. The relay is a latency cost -- typically tens of milliseconds with CDC, hundreds with polling. For the use cases that want the outbox, that latency is acceptable. If you need synchronous publication, you do not want at-most-once consistency; you want a command reply.

"I can just publish after the COMMIT -- the DB is now the source of truth." A crash between COMMIT and publish still loses the event. The publish-after-commit window is smaller than the pre-commit one, but it is not zero. Many bugs have hidden in exactly that window.

"The outbox table will grow forever." Truncate or archive rows older than a bounded horizon (e.g., 7 days after published_at). The outbox is an integration buffer, not an audit log. Keep the events themselves on the broker (Kafka with long retention) or in a dedicated event store if you need history.

"I do not need event_id if the DB primary key is the order ID." You do. The event ID is used by consumers for dedup, including redelivery after relay retries. It is distinct from the aggregate ID.

How To Use It

Adoption checklist:

Add an outbox table with event_id PK, event_type, payload, created_at, published_at.
Inside every write transaction that should produce events, insert the corresponding outbox row(s) in the same transaction.
Deploy a relay that polls (or tails the WAL) and publishes at-least-once.
Give consumers a dedup strategy keyed on event_id (Concept 12).
Monitor outbox lag (max(now() - created_at) WHERE published_at IS NULL) and alert when it exceeds a threshold (e.g., 30 s).

Check Yourself

Why does publishing after COMMIT still have a race window, and how small is the window in practice?
What happens if the relay crashes after publishing to the broker but before updating published_at? Is that a bug?
What would go wrong if you used DELETE FROM outbox instead of UPDATE ... SET published_at?

Mini Drill or Application

In 20 minutes:

Take a service you know that does "write DB then call external API / publish." Sketch the dual-write failure modes in 3-5 sentences.
Write the outbox DDL for that service and the single SQL transaction that replaces the dual write.
Sketch the relay in 10-15 lines of pseudocode, including the FOR UPDATE SKIP LOCKED detail.
Decide where dedup happens on the consumer side.

Transfer to Adjacent Domains

Idempotency (Concept 12). The outbox guarantees at-least-once publication. Consumers must therefore be idempotent; the two patterns are complementary halves of reliable event delivery and should be designed together, not sequentially.
Sagas (Concept 11). Every saga participant that writes DB state and emits an event needs an outbox -- otherwise a participant crash leaves the saga in an unrecoverable "state changed, event missed" limbo.
Event sourcing (Concept 13). Event-sourced systems sidestep the dual-write problem entirely (the event is the write). If you're in an event-sourced bounded context, you don't need an outbox; if you're CRUD+events, you do.
Data platform (S6). Teams feeding Snowflake/Databricks via CDC routinely reinvent outbox mechanics in ELT pipelines. Centralizing on one outbox with a schema contract is cheaper than per-pipeline de-duplication.
Operational SRE. Outbox lag is a first-class SLI for event-driven services: max(now() - created_at) WHERE published_at IS NULL. Alert on it and you catch most publication regressions before customers do.

Read This Only If Stuck

Richards & Ford: Preventing Data Loss -- the broader durability discussion that motivates outbox
Richards & Ford: Event-Driven Architecture Style -- where the broker sits in the topology
System Design Primer: Asynchronism -- queue basics that help reason about relay behavior
System Design Primer: Consistency patterns -- frames why atomicity across two systems is hard without outbox
Microservices.io: Transactional outbox -- the pattern as cataloged by Chris Richardson
Microservices.io: Transaction log tailing -- the CDC variant of the same idea
Microservices.io: Polling publisher -- the simplest relay implementation when CDC is unavailable
Confluent: The Transactional Outbox Pattern (microservices course) -- video + code walkthrough with Kafka
Debezium documentation: Outbox event router -- the most common production CDC implementation
Debezium: PostgreSQL connector -- the details of tailing a WAL-based outbox

What This Concept Is​

Why It Matters Here​

Concrete Example​

Outbox schema​

Producer (the service doing real work)​

Relay (a separate worker)​

Variants​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Transfer to Adjacent Domains​

Read This Only If Stuck​