Skip to main content

Projections and Read Models

What This Concept Is

A projection (sometimes: read model, materialized view) is a derived data structure built by consuming an event stream and folding its events into a query-optimized shape. It exists to serve reads efficiently; it is not the source of truth.

event log ---> projection handler ---> read store (DB, cache, search index)

The three defining properties:

  1. Derived. The projection's state is a pure function of the events it has consumed.
  2. Rebuildable. Drop the projection, reset its offset to zero, re-consume. You get the same state. No data loss.
  3. Independent. You can run many projections over the same events, each tailored to a specific query shape.

Projections exist anywhere events flow -- in event-sourced systems (Concept 13) they are the dominant way to read data, and in CRUD-plus-events systems they are the natural substrate for cross-service read models without 2PC.

Why It Matters Here

Projections are how you pay for the read side after you have committed to the event side:

  • Query optimization without schema contortion. The write side can stay normalized; each read side picks its own shape: relational, document, search index, materialized aggregate.
  • Decoupling. Each service builds its own projection from the broker; no cross-database joins.
  • Evolvability. Change the read shape? Build a new projection, switch readers over, drop the old one.
  • Rebuildability. Found a bug in a projection? Reset offset, replay, fixed.

In short: events give you the what happened; projections give you the how we answer queries about it.

Concrete Example: Three Projections Over One Stream

Events on order.* topic: OrderPlaced, PaymentCaptured, OrderShipped, OrderCancelled.

Projection A -- "Customer dashboard" read model

Shape: one row per order with summary fields.

CREATE TABLE customer_order_summary (
order_id TEXT PRIMARY KEY,
customer_id TEXT NOT NULL,
status TEXT NOT NULL,
total_cents INT,
placed_at TIMESTAMPTZ,
shipped_at TIMESTAMPTZ
);

Handler (pseudo):

def on(event):
match event.type:
case "OrderPlaced":
UPSERT into customer_order_summary (order_id, customer_id, status='placed',
total_cents=event.total_cents, placed_at=event.occurred_at)
case "PaymentCaptured":
UPDATE set status='paid' where order_id=event.order_id
case "OrderShipped":
UPDATE set status='shipped', shipped_at=event.occurred_at ...
case "OrderCancelled":
UPDATE set status='cancelled' ...

Reads are trivial (single-row or index-scan). The projection is always lightly behind the live stream.

Projection B -- "Revenue by day" analytics cube

Shape: aggregated by day.

CREATE TABLE revenue_by_day (
day DATE PRIMARY KEY,
orders INT,
revenue BIGINT
);

Handler consumes only PaymentCaptured, increments revenue and orders for the appropriate day. No joins; no scanning the orders DB.

Projection C -- "Search the orders" index

A consumer indexes each order into Elasticsearch, flattened: customer name, SKU list, status, total, tags. A search query never touches the source of truth.

The key observation

All three projections read the same event stream, each folding it into a completely different shape. Dropping and rebuilding any of them is a routine operation: stop the consumer, recreate the table/index, reset the offset to 0, start consuming.

Rebuildability in Practice

The rebuild workflow:

1. create parallel read store (new_customer_order_summary)
2. create a new consumer group "customer_order_summary_v2"
3. reset offsets to zero (or to a historical starting point)
4. consume the stream, building the new table
5. once caught up, switch reader traffic to the new table
6. drop the old table

Two gotchas:

  • Your event stream must retain history for as long as you need to rebuild. A Kafka topic with 7-day retention cannot rebuild a 2-year projection; that projection needs a log with longer retention or a dedicated event store.
  • During rebuild, the new projection lags. Switch reader traffic only after lag == 0.

Handling Out-of-Order or Late Events

In a partitioned stream (Concept 09), events for a given aggregate arrive in order within a partition. But across aggregates, a projection can see events "out of wall-clock order": an older event may arrive later due to rebalance, reprocessing, or a slower producer.

Design choices:

  • Monotonic transitions: write the projection such that older events cannot overwrite newer ones (UPDATE ... WHERE occurred_at < current). Use event timestamps or version numbers.
  • Idempotent upsert: the projection re-applies the same event and converges.
  • Event-time vs processing-time: for analytics, fold by event time (day = event.occurred_at::date), not processing time.

Common Confusion / Misconception

"The projection is the source of truth for reads." It is the source of query answers, not of truth. Truth lives in the events. If a projection and the events disagree, the events win and you rebuild.

"Projections give up consistency." They give up strong consistency; they are eventually consistent. For most read paths -- dashboards, search, aggregated analytics -- this is fine. For strict read-your-own-write paths, you serve from the write side or use a synchronous fallback.

"We can have one projection that answers all queries." You can, and it usually becomes a bloated, slow read side. The power of projections is specialization -- one per query shape.

"Projections need to be relational." A projection is any fold of the event stream into a readable shape. Document store, search index, in-memory structure, or a flat file are all fine.

"Every event updates every projection." Each projection subscribes only to the event types it cares about. Ignore the rest.

How To Use It

Designing a projection:

  1. Pick the query. What single query does this projection answer, and at what latency?
  2. Pick the store. Relational for joinable views, document for nested data, search for text, columnar for analytics.
  3. Pick the event subset. Which event types actually change this projection?
  4. Write the handler. For each event type, the update rule must be idempotent and monotonic.
  5. Decide lag tolerance. Alert if projection lag exceeds the SLO (e.g., 5 seconds).
  6. Plan the rebuild. Store + consumer group + ability to reset offsets.

Check Yourself

  1. Why must every projection be rebuildable, and what does that require of the event stream?
  2. Name three projections of an OrderPlaced + PaymentCaptured stream, each with a different store type.
  3. Why is "eventually consistent" the right default for projections, and when is it wrong?
  4. What property must a projection handler have so that reprocessing the same event does not corrupt state?

Mini Drill or Application

Take a dashboard or read API you know (a user timeline, an inventory summary, a transaction history). In 20 minutes:

  1. Identify the event stream(s) it could be built from.
  2. Write the DDL for the read model.
  3. Write the handler (pseudocode) for each event type.
  4. Describe the rebuild procedure end-to-end.
  5. Name one edge case (late event, out-of-order, partial event loss) and how the handler tolerates it.

Transfer to Adjacent Domains

  • CQRS (Concept 15). Projections are the structural thing CQRS commits to. Once you can build three projections over one log, CQRS is already implicit; formalizing it is mostly a governance and team-ownership exercise.
  • Event sourcing (Concept 13). Rebuildable projections are what make ES practical -- you do not have to get the read shape right up front. "Change the projection, reset offset, re-fold" is the superpower ES gives you.
  • Caching (classic distributed systems). A projection is a materialized, eventually-consistent cache fed by events. The cache-update-pattern literature (cache-aside, write-through, write-behind) maps onto projections if you squint -- but projections give up invalidation problems in exchange for rebuild cost.
  • Search (Elasticsearch, OpenSearch). The canonical "search index over a service's data" design is just a projection fed by events. Teams that maintain hand-rolled ETL into Elastic usually benefit from rewriting it as a projection -- lower drift, lower latency, trivially rebuildable.
  • Analytics warehouse (S6). A projection built into a columnar store (Snowflake, BigQuery) is often the cleanest way to feed a BI dashboard. The same events feed operational dashboards and analytics cubes -- no ETL parallel to the event layer.

Read This Only If Stuck