Data Model Design
What This Concept Is
Data model design for a system under scale is four decisions, made in this order:
- Entities: what are the nouns of the domain? Each gets a primary key and a schema.
- Relationships: which entities reference which? 1:1, 1:N, N:M? Enforced or denormalized?
- Indexes: which fields are queried? Which combinations? What is the cardinality and selectivity of each?
- Partitioning key: in a distributed store, which field determines which shard a row lives on? This is the single most load-sensitive decision in the design.
At this stage you write table sketches, not DDL. Columns with types, primary key, partitioning key, and the 2-3 indexes that matter. If the problem requires a relational model, draw the ER sketch; if it is wide-column, draw per-partition row layout; if it is document, draw one sample document with its access patterns annotated.
An access-pattern-first discipline, drawn from the AWS DynamoDB modeling guide: enumerate the queries before the schema. In SQL you can get away with normalizing first and adding indexes later, because the query planner rewrites your queries. In wide-column and KV stores, the schema is the index, and retrofitting a new access pattern often means introducing a second denormalized copy of the data kept in sync by events. Saying "we will figure out queries later" is fine for Postgres and lethal for DynamoDB or Cassandra.
Why It Matters Here
Data modeling mistakes compound. They cost rewrites, migrations, hot-partition incidents, and correctness bugs that data auditors find months later.
- Wrong entity boundary -> application-level joins and consistency bugs.
- Wrong index -> 100× slower reads or 10× slower writes.
- Wrong partitioning key -> hot shards, melted nodes, rebalancing nightmares.
- Wrong relationship denormalization -> divergent truth across tables.
Cluster 4 (scale) often comes down to "your partitioning key was wrong"; catching it here saves the session.
Concrete Example: Social Feed Timeline
Social feed, Timeline store (wide-column, e.g., Cassandra):
Table: user_timeline
partition_key: user_id -- all of one user's feed lives on one shard
clustering_key: created_at DESC -- within the partition, sorted newest-first
columns:
post_id uuid
author_id bigint
post_kind enum('text','image','retweet')
summary text -- denormalized for list rendering
author_handle text -- denormalized; avoids a join
created_at timestamp
Typical read:
SELECT * FROM user_timeline
WHERE user_id = :uid
AND created_at < :cursor
LIMIT 50;
Annotations:
- Entities:
User,Post,TimelineEntry. Users and Posts live in other tables;TimelineEntryis a derived entity maintained by fan-out. - Relationships:
TimelineEntryreferencesPost(bypost_id) andUser(byuser_idas owner,author_idas denorm). There is intentional duplication across user partitions -- one post becomes N timeline entries. - Indexes: the clustering key
created_at DESCis the index for the pagination query. No secondary index is needed for this table. - Partitioning key:
user_iddistributes read load evenly across users. An alternative,author_id, would have concentrated load on celebrity authors -- exactly the wrong shape.
Contrast: Posts store (same design):
Table: posts
partition_key: author_id, bucket -- (author_id, floor(created_at/day))
clustering_key: created_at DESC
columns:
post_id uuid
body text
media_ref text
created_at timestamp
Here, author_id plus a time bucket avoids unbounded partition growth for prolific authors -- a common hot-partition fix. Without the bucket, an active author's partition grows forever, which wide-column stores handle poorly.
For a URL shortener:
Table: shorts (wide-column or KV)
partition_key: short_code -- primary access is by code
columns:
long_url text
owner_user_id bigint nullable
created_at timestamp
expires_at timestamp nullable
Index for "my short links" query:
Table: shorts_by_owner
partition_key: owner_user_id
clustering_key: created_at DESC
columns: short_code
Notice how the secondary access pattern ("list my links") becomes a second table with its own partition key. This is standard in wide-column and NoSQL design: you design one table per access pattern.
Concrete Example 2: Chat Messages
Table: messages
partition_key: (conversation_id, day_bucket) -- one day of one chat
clustering_key: (timestamp ASC, message_id)
columns:
sender_id bigint
body text
attachments list<uri>
client_msg_id uuid -- for idempotency
status enum
Why this shape:
- Partitioning by
conversation_idalone seems obvious and fails on long-running chats (group chat over years -> unbounded partition). - Adding
day_bucketputs each day of a chat in its own partition. Typical reads ("last 50 messages in this chat") hit the current bucket; "load older" moves to the previous bucket. Partition size is bounded bymessages_per_day × day_length. - Clustering
ASCmatches forward-scroll reads; chat UIs typically read oldest-of-the-page forward, then cache client-side. client_msg_idsupports idempotency. When a client retries a send after a network blip, the server deduplicates by this key -- a common exactly-once approximation.
Access patterns this table serves:
| Pattern | Read path |
|---|---|
| Last N messages | WHERE conversation_id=? AND day_bucket=today ORDER BY timestamp DESC LIMIT 50 |
| Scroll older | Next bucket back; paginate |
| Message delivery receipt | Update status column (partition-local, cheap) |
Access patterns this table does NOT serve cheaply:
| Pattern | Why not |
|---|---|
| "Search messages by text in a conversation" | Needs a separate search index (OpenSearch, driven by an async consumer) |
| "All messages sent by user X across all conversations" | Second table messages_by_sender partitioned by sender_id |
The last row is the important one -- NoSQL data modeling is a table-per-access-pattern discipline, and being explicit about which patterns each table does not serve prevents later surprise.
Common Confusion / Misconceptions
"Partitioning key is whatever the primary key is." No. The primary key uniquely identifies a row; the partition key controls where the row lives. In some stores they are the same; in others they differ (composite keys). Confusing the two creates either hot shards or impossible queries.
"Denormalization is bad." It depends on the store. In SQL, normalize first and denormalize only when you must. In wide-column and document stores, denormalization is the native model; joins at read time are the anti-pattern.
"One table covers all access patterns." In SQL with appropriate indexes, often yes. In NoSQL, almost never. Design one (materialized) view per access pattern and keep them updated via events or triggers.
"Indexes are free." Every secondary index is a write-path tax and a storage cost. Three well-chosen indexes beat twelve speculative ones.
"Buckets are a micro-optimization." No. Unbounded partitions are a foot-gun unique to wide-column and KV stores; a time-bucket is a standard cure and you should reach for it reflexively on append-only tables.
How To Use It
Per primary entity:
- State the entity in one sentence.
- Write the column sketch: field names, types, nullability.
- Pick the primary key. Circle it.
- Pick the partitioning key. Circle it separately.
- List the 2-3 indexes (secondary tables in NoSQL; real indexes in SQL).
- Write the 2-3 queries this entity serves, using those indexes.
- Ask: is any partition unbounded? If so, add a bucket.
Repeat per entity. Three to five entities is typical.
Transfer / Where This Shows Up Later
- Cluster 3 concept 9 (consistency) attaches a concurrency/consistency contract to each of these mutable tables.
- Cluster 4 concept 10 (scale) audits the partitioning key at 10× and 100×.
- Cluster 4 concept 12 (SPOFs) often finds that an unbounded partition is a soft SPOF.
- S8M3 (data patterns) is the deep module on this -- it covers multi-model persistence, CDC, event sourcing.
- S9 (cloud) maps partitioning keys to DynamoDB partition keys + sort keys; Aurora/Postgres logical sharding; Spanner interleaved tables.
- S10 capstone/interviews: partitioning-key questions are the single most common follow-up after the box diagram; a wrong answer here unwinds the rest of the design.
Check Yourself
- In the timeline table above, what goes wrong if we partition by
created_atinstead ofuser_id? - Why is
(author_id, day_bucket)a safer partitioning key thanauthor_idalone for an append-only post table? - What is the cost of adding a fifth secondary index on a write-heavy table?
- Name one access pattern your chat
messagestable above does not serve and the second table you'd add for it.
Mini Drill or Application
For each workload, produce column sketches with primary key, partitioning key, and 2-3 indexes:
- Chat messages per conversation (1 M conversations, 100 msgs/day each).
- Orders per merchant (10 K merchants, up to 10 K orders/day each).
- IoT sensor readings (1 M devices, 1 reading/sec each).
- Per-user audit log (1 B users, a few events/day each).
For each, name one partitioning key that looks tempting and explain why it is wrong.
Read This Only If Stuck
- System Design Primer: Database federation and sharding -- primary reference for partition-key choice.
- System Design Primer: NoSQL databases -- KV, document, wide-column, graph native access patterns.
- System Design Primer: SQL or NoSQL -- decision table.
- System Design Primer: Database RDBMS replication -- replication implications of the primary-key choice.
- Fundamentals: Database partitioning -- logical vs physical partitioning.
- Fundamentals: Connascence -- denormalization is a kind of data-level connascence; track it explicitly.
- AWS -- NoSQL design for DynamoDB -- definitive access-pattern-first modeling guide.
- Martin Fowler -- Catalog of Patterns of Distributed Systems -- Fixed Partitions, Key-Range Partitions, Segmented Log all apply here.
- High Scalability -- case studies where the wrong partition key caused an outage.