Consumer Groups, Partitions, Ordering Guarantees

What This Concept Is

Three Kafka (and Kafka-family) primitives decide who reads what, in what order, and at what speed:

Partition: an ordered, immutable shard of a topic. All ordering guarantees live at the partition level.
Consumer group: a named set of consumers that share a topic's partitions. Within a group, each partition is assigned to exactly one consumer. Different groups are independent.
Key-based partitioning: producers choose a key; the broker hashes the key to a partition. Messages with the same key land on the same partition -- and so are processed in order.

These three together are how Kafka combines pub-sub (across groups) and point-to-point work distribution (within a group), on a log substrate.

The ASCII Diagram You Must Internalize

 Topic: order.placed           (partitions: P0, P1, P2)
   P0: | key=ord_001 | key=ord_004 | key=ord_007 | ...
   P1: | key=ord_002 | key=ord_005 | key=ord_008 | ...
   P2: | key=ord_003 | key=ord_006 | key=ord_009 | ...

                                  |
                                  v

 +--------------------------+       +--------------------------+
 | Consumer group: billing  |       | Consumer group: analytics|
 |                          |       |                          |
 |   c1  <-- P0             |       |   a1  <-- P0, P1, P2     |
 |   c2  <-- P1             |       |                          |
 |   c3  <-- P2             |       |   (one consumer reading  |
 |                          |       |    all 3 partitions)     |
 | (3 consumers, 3 parts,   |       |                          |
 |  1-to-1 assignment)      |       |                          |
 +--------------------------+       +--------------------------+

 Rules:
   - within a group, one consumer per partition
   - group "billing" and group "analytics" each see every message (pub-sub)
   - within group "billing", messages for ord_001 always go to c1
     (because all ord_001 events are on P0, and c1 owns P0)

Read this diagram until the three rules are reflex:

Partition order is the only order Kafka guarantees.
Same key -> same partition -> same consumer (for a given group state).
Consumer groups do not see each other; they read independently.

The Arithmetic That Trips People Up

Situation	What happens
`partitions = 3, consumers in group = 1`	The one consumer reads all 3 partitions (serialized). Fine, but no parallelism.
`partitions = 3, consumers in group = 3`	Each consumer owns 1 partition. Maximum parallelism.
`partitions = 3, consumers in group = 5`	3 consumers get one partition each. 2 are idle.
`partitions = 3, consumers in group = 4`	Uneven: one consumer gets 0, others get 1.
`groups = 2, each with 3 consumers`	Each group independently reads all partitions. Each message handled by exactly one consumer in each group.

Consequence: you cannot scale past your partition count within a group. If you want more consumer parallelism, you must increase partitions (which is disruptive for keyed topics -- old keys redistribute).

Ordering Guarantees: The Honest Story

Kafka guarantees:

Per-partition, per-key, FIFO order. If you key by order_id, every event for ord_9f2a is ordered.
Nothing across partitions. If two keys hash to different partitions, their global order is not defined.
Nothing at the topic level unless you use exactly 1 partition (which collapses throughput).

Practical pattern: partition by the identity whose order matters. Events about the same order: key by order_id. Events about the same user: key by user_id. Payments by account? key by account_id. Anything that must be processed in order with respect to itself lives on the same partition.

Failure modes you will see

Hot partitions -- 90% of traffic keys to one entity (a celebrity user, a big merchant). The partition saturates and one consumer lags.
Re-partitioning breaks order -- if you add partitions, existing keys may remap, and in-flight events for a key could arrive out of order relative to new ones.
Rebalances stall consumption -- when a consumer joins/leaves a group, the broker reassigns partitions. During the rebalance, no messages are delivered. Newer Kafka (cooperative / incremental rebalancing) reduces this, but it still exists.

Delivery Guarantees, Briefly

Tied intimately to offsets:

At-most-once: commit offset before processing. Crash loses the message.
At-least-once (default): commit offset after processing. Crash re-delivers. Consumers must be idempotent (Concept 12).
Exactly-once within Kafka (EOS): the producer + transactions API + isolation.level=read_committed gives end-to-end EOS if you stay inside Kafka (Kafka-to-Kafka pipelines, Kafka Streams). The moment you touch an external system (DB, HTTP API), you are back to at-least-once and idempotency rules.

Why It Matters Here

Every saga (Concept 11), projection (Concept 14), and event-sourced system (Concept 13) is built on top of ordering and group semantics. Misunderstanding the three primitives above produces specific pathologies:

saga steps arrive out of order -> events for the same order were keyed inconsistently, hitting different partitions
a read model "flickers" -> two consumers in the same group are consuming the same key (because of a misunderstanding about group vs client)
downstream system sees duplicates -> consumer was committing before processing, then restarting
consumer lag grows unboundedly -> one hot partition, or more consumers than partitions

Concrete Example

Same topic, two groups, with partitioning by order_id:

Topic: order.placed (12 partitions, key = order_id)

Group "billing" (12 consumers)
  -> 1 consumer per partition; ord_9f2a always handled by the same billing consumer

Group "analytics" (3 consumers)
  -> each consumer handles 4 partitions; slower but simpler deployment

Both groups see every message.
If billing doubles its consumer count to 24, 12 are idle -- partition count is the ceiling.

Common Confusion / Misconception

"Kafka keeps messages ordered." Per partition. Stop. Not globally.

"More consumers = more throughput." Up to the partition count. Past that, adding consumers adds idle processes.

"Consumer group is the same as a subscriber." Not quite. A subscriber is a logical listener for the stream. A group is how the broker shares work among a listener's processes. Usually one service == one group, but the same service may run multiple groups (e.g., one primary and one "replay" group).

"Keying guarantees exactly-once delivery." Keying guarantees order for that key. Idempotency is a separate problem (Concept 12).

"Increasing partition count is safe." Increasing partitions reshuffles key-to-partition mapping for the new partitions and breaks key-order invariants for keys that move. Treat it as a one-way door; plan partition count up front.

How To Use It

Topic-design checklist:

Key choice: the entity whose ordering matters. Default: the aggregate ID.
Partition count: 2-3x peak intended consumer parallelism, with headroom for growth.
Consumer group per logical subscriber: one group per service or per logical listener.
Commit strategy: after successful processing, with manual commit and batch size tuned to failure blast radius.
Rebalance handling: ensure your consumers clean up and resume cleanly on partition revoke/assign callbacks.
Hot-key plan: if you expect skew (big accounts, celebrity users), consider composite keys (account:region) or separate topics for VIP flows.

Check Yourself

A topic has 4 partitions and a consumer group has 6 consumers. What happens to the extra 2?
Why does "key by order_id" give per-order FIFO but not per-customer FIFO?
In a group, two consumers are assigned to the same partition. What did you do wrong?
Your saga sometimes sees PaymentCaptured before OrderPlaced. What are the two most likely causes?

Mini Drill or Application

Design the partitioning for a checkout system producing OrderPlaced, PaymentCaptured, StockReserved, OrderShipped. In 20 minutes:

Pick the partition key for each topic. Justify.
Pick partition count for each topic. Justify.
Decide whether a saga orchestrator consuming all four should be one group or four groups. Why?
Identify one hot-key risk and mitigate it.

Transfer to Adjacent Domains

Sagas & workflow (Cluster 4). Saga steps for one entity must land in one partition. Choosing saga_id or aggregate ID as the partition key is what keeps sagas ordered when they ride on a Kafka substrate.
Projections (Concept 14). A projection reads one or more partitions; its lag is a per-partition measurement, not a topic measurement. Hot partitions produce hot projection lag. The diagnostic path goes straight from "dashboard stale" to "which partition" to "which key is oversubscribed."
Database sharding (S6). Partition-key selection in Kafka is structurally the same problem as shard-key selection in a distributed DB: pick for even distribution and for per-entity locality. The failure modes (hot shards / hot partitions) are the same.
Scalability math. "Consumers ≤ partitions" is the single arithmetic rule that dominates Kafka-backed scale-out. Many teams over-provision consumers for months before realizing the ceiling is partition count.
Deployment / rollout. Rebalances pause consumption. Plan deploys with cooperative rebalancing on, or accept a lag spike. This is an operational concern that purely Kafka-naive teams rediscover under pressure.

Read This Only If Stuck

Richards & Ford: Event-Driven Architecture Style -- the broker topology and why consumer independence matters
Richards & Ford: Event-Driven Ratings -- scalability and elasticity ratings that depend on partition/consumer design
Richards & Ford: Data Collisions -- ordering/conflict concerns that show up in multi-consumer designs
System Design Primer: Database -- federation/sharding -- the sharding analogy; Kafka partitions are the same concept applied to logs
Apache Kafka: Consumer Configs and Design: Consumers -- canonical description of consumer groups and offsets
Confluent: Consumer Groups Explained -- approachable walkthrough of group membership and rebalancing
Confluent: Incremental Cooperative Rebalancing -- why modern consumers avoid stop-the-world rebalances
Confluent: Cooperative Rebalancing (Streams, ksqlDB) -- the rebalance story extended to stateful stream processing
Confluent: KIP-848 -- next-generation rebalance protocol -- the direction Kafka is moving for very large consumer groups
Confluent: Exactly-once Semantics -- where partition ordering meets EOS guarantees
Jay Kreps: Benchmarking Apache Kafka -- throughput context for partition-count decisions

What This Concept Is​

The ASCII Diagram You Must Internalize​

The Arithmetic That Trips People Up​

Ordering Guarantees: The Honest Story​

Failure modes you will see​

Delivery Guarantees, Briefly​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Transfer to Adjacent Domains​

Read This Only If Stuck​