Consumer Groups, Partitions, Ordering Guarantees
What This Concept Is
Three Kafka (and Kafka-family) primitives decide who reads what, in what order, and at what speed:
- Partition: an ordered, immutable shard of a topic. All ordering guarantees live at the partition level.
- Consumer group: a named set of consumers that share a topic's partitions. Within a group, each partition is assigned to exactly one consumer. Different groups are independent.
- Key-based partitioning: producers choose a key; the broker hashes the key to a partition. Messages with the same key land on the same partition -- and so are processed in order.
These three together are how Kafka combines pub-sub (across groups) and point-to-point work distribution (within a group), on a log substrate.
The ASCII Diagram You Must Internalize
Topic: order.placed (partitions: P0, P1, P2)
P0: | key=ord_001 | key=ord_004 | key=ord_007 | ...
P1: | key=ord_002 | key=ord_005 | key=ord_008 | ...
P2: | key=ord_003 | key=ord_006 | key=ord_009 | ...
|
v
+--------------------------+ +--------------------------+
| Consumer group: billing | | Consumer group: analytics|
| | | |
| c1 <-- P0 | | a1 <-- P0, P1, P2 |
| c2 <-- P1 | | |
| c3 <-- P2 | | (one consumer reading |
| | | all 3 partitions) |
| (3 consumers, 3 parts, | | |
| 1-to-1 assignment) | | |
+--------------------------+ +--------------------------+
Rules:
- within a group, one consumer per partition
- group "billing" and group "analytics" each see every message (pub-sub)
- within group "billing", messages for ord_001 always go to c1
(because all ord_001 events are on P0, and c1 owns P0)
Read this diagram until the three rules are reflex:
- Partition order is the only order Kafka guarantees.
- Same key -> same partition -> same consumer (for a given group state).
- Consumer groups do not see each other; they read independently.
The Arithmetic That Trips People Up
| Situation | What happens |
|---|---|
partitions = 3, consumers in group = 1 | The one consumer reads all 3 partitions (serialized). Fine, but no parallelism. |
partitions = 3, consumers in group = 3 | Each consumer owns 1 partition. Maximum parallelism. |
partitions = 3, consumers in group = 5 | 3 consumers get one partition each. 2 are idle. |
partitions = 3, consumers in group = 4 | Uneven: one consumer gets 0, others get 1. |
groups = 2, each with 3 consumers | Each group independently reads all partitions. Each message handled by exactly one consumer in each group. |
Consequence: you cannot scale past your partition count within a group. If you want more consumer parallelism, you must increase partitions (which is disruptive for keyed topics -- old keys redistribute).
Ordering Guarantees: The Honest Story
Kafka guarantees:
- Per-partition, per-key, FIFO order. If you key by
order_id, every event forord_9f2ais ordered. - Nothing across partitions. If two keys hash to different partitions, their global order is not defined.
- Nothing at the topic level unless you use exactly 1 partition (which collapses throughput).
Practical pattern: partition by the identity whose order matters. Events about the same order: key by order_id. Events about the same user: key by user_id. Payments by account? key by account_id. Anything that must be processed in order with respect to itself lives on the same partition.
Failure modes you will see
- Hot partitions -- 90% of traffic keys to one entity (a celebrity user, a big merchant). The partition saturates and one consumer lags.
- Re-partitioning breaks order -- if you add partitions, existing keys may remap, and in-flight events for a key could arrive out of order relative to new ones.
- Rebalances stall consumption -- when a consumer joins/leaves a group, the broker reassigns partitions. During the rebalance, no messages are delivered. Newer Kafka (cooperative / incremental rebalancing) reduces this, but it still exists.
Delivery Guarantees, Briefly
Tied intimately to offsets:
- At-most-once: commit offset before processing. Crash loses the message.
- At-least-once (default): commit offset after processing. Crash re-delivers. Consumers must be idempotent (Concept 12).
- Exactly-once within Kafka (EOS): the producer + transactions API +
isolation.level=read_committedgives end-to-end EOS if you stay inside Kafka (Kafka-to-Kafka pipelines, Kafka Streams). The moment you touch an external system (DB, HTTP API), you are back to at-least-once and idempotency rules.
Why It Matters Here
Every saga (Concept 11), projection (Concept 14), and event-sourced system (Concept 13) is built on top of ordering and group semantics. Misunderstanding the three primitives above produces specific pathologies:
- saga steps arrive out of order -> events for the same order were keyed inconsistently, hitting different partitions
- a read model "flickers" -> two consumers in the same group are consuming the same key (because of a misunderstanding about group vs client)
- downstream system sees duplicates -> consumer was committing before processing, then restarting
- consumer lag grows unboundedly -> one hot partition, or more consumers than partitions
Concrete Example
Same topic, two groups, with partitioning by order_id:
Topic: order.placed (12 partitions, key = order_id)
Group "billing" (12 consumers)
-> 1 consumer per partition; ord_9f2a always handled by the same billing consumer
Group "analytics" (3 consumers)
-> each consumer handles 4 partitions; slower but simpler deployment
Both groups see every message.
If billing doubles its consumer count to 24, 12 are idle -- partition count is the ceiling.
Common Confusion / Misconception
"Kafka keeps messages ordered." Per partition. Stop. Not globally.
"More consumers = more throughput." Up to the partition count. Past that, adding consumers adds idle processes.
"Consumer group is the same as a subscriber." Not quite. A subscriber is a logical listener for the stream. A group is how the broker shares work among a listener's processes. Usually one service == one group, but the same service may run multiple groups (e.g., one primary and one "replay" group).
"Keying guarantees exactly-once delivery." Keying guarantees order for that key. Idempotency is a separate problem (Concept 12).
"Increasing partition count is safe." Increasing partitions reshuffles key-to-partition mapping for the new partitions and breaks key-order invariants for keys that move. Treat it as a one-way door; plan partition count up front.
How To Use It
Topic-design checklist:
- Key choice: the entity whose ordering matters. Default: the aggregate ID.
- Partition count: 2-3x peak intended consumer parallelism, with headroom for growth.
- Consumer group per logical subscriber: one group per service or per logical listener.
- Commit strategy: after successful processing, with manual commit and batch size tuned to failure blast radius.
- Rebalance handling: ensure your consumers clean up and resume cleanly on partition revoke/assign callbacks.
- Hot-key plan: if you expect skew (big accounts, celebrity users), consider composite keys (
account:region) or separate topics for VIP flows.
Check Yourself
- A topic has 4 partitions and a consumer group has 6 consumers. What happens to the extra 2?
- Why does "key by
order_id" give per-order FIFO but not per-customer FIFO? - In a group, two consumers are assigned to the same partition. What did you do wrong?
- Your saga sometimes sees
PaymentCapturedbeforeOrderPlaced. What are the two most likely causes?
Mini Drill or Application
Design the partitioning for a checkout system producing OrderPlaced, PaymentCaptured, StockReserved, OrderShipped. In 20 minutes:
- Pick the partition key for each topic. Justify.
- Pick partition count for each topic. Justify.
- Decide whether a saga orchestrator consuming all four should be one group or four groups. Why?
- Identify one hot-key risk and mitigate it.
Transfer to Adjacent Domains
- Sagas & workflow (Cluster 4). Saga steps for one entity must land in one partition. Choosing
saga_idor aggregate ID as the partition key is what keeps sagas ordered when they ride on a Kafka substrate. - Projections (Concept 14). A projection reads one or more partitions; its lag is a per-partition measurement, not a topic measurement. Hot partitions produce hot projection lag. The diagnostic path goes straight from "dashboard stale" to "which partition" to "which key is oversubscribed."
- Database sharding (S6). Partition-key selection in Kafka is structurally the same problem as shard-key selection in a distributed DB: pick for even distribution and for per-entity locality. The failure modes (hot shards / hot partitions) are the same.
- Scalability math. "Consumers ≤ partitions" is the single arithmetic rule that dominates Kafka-backed scale-out. Many teams over-provision consumers for months before realizing the ceiling is partition count.
- Deployment / rollout. Rebalances pause consumption. Plan deploys with cooperative rebalancing on, or accept a lag spike. This is an operational concern that purely Kafka-naive teams rediscover under pressure.
Read This Only If Stuck
- Richards & Ford: Event-Driven Architecture Style -- the broker topology and why consumer independence matters
- Richards & Ford: Event-Driven Ratings -- scalability and elasticity ratings that depend on partition/consumer design
- Richards & Ford: Data Collisions -- ordering/conflict concerns that show up in multi-consumer designs
- System Design Primer: Database -- federation/sharding -- the sharding analogy; Kafka partitions are the same concept applied to logs
- Apache Kafka: Consumer Configs and Design: Consumers -- canonical description of consumer groups and offsets
- Confluent: Consumer Groups Explained -- approachable walkthrough of group membership and rebalancing
- Confluent: Incremental Cooperative Rebalancing -- why modern consumers avoid stop-the-world rebalances
- Confluent: Cooperative Rebalancing (Streams, ksqlDB) -- the rebalance story extended to stateful stream processing
- Confluent: KIP-848 -- next-generation rebalance protocol -- the direction Kafka is moving for very large consumer groups
- Confluent: Exactly-once Semantics -- where partition ordering meets EOS guarantees
- Jay Kreps: Benchmarking Apache Kafka -- throughput context for partition-count decisions