Skip to main content

Choose a Storage Approach

What This Concept Is

Choosing a storage approach is not "SQL vs NoSQL". It is a four-question conversation you hold with yourself on the board:

  1. What is the access pattern? Point lookups by primary key? Range scans? Secondary-index queries? Joins? Full-text search? Time-series append-and-scan?
  2. What is the consistency contract? Must a read immediately see a prior write? Tolerate staleness? Tolerate conflict resolution?
  3. What is the size and growth? Does the hot set fit in RAM? Does the total fit on one machine? Is growth predictable?
  4. What is the durability and failure target? Synchronous cross-AZ replication? Cross-region? Point-in-time recovery?

The storage choice (SQL, document, key-value, wide-column, search, time-series, object store, graph, or a hybrid of these) falls out of the four answers. The choice is never "because I like it".

A senior mental model: polyglot persistence (Martin Fowler's term). One system uses multiple storage technologies because one technology cannot natively serve all the access patterns at the required cost and latency. The pushback against "one DB for everything" is not dogma -- it is what happens when you honestly enumerate access patterns and discover they have conflicting native shapes (OLTP vs OLAP, transactional vs full-text, hot-KV vs analytics). The cost of polyglot is operational (more tech to run, more failure modes); the cost of monolithic storage is architectural (forcing workloads onto stores that fight them). Your job is to say which cost you are taking and why.

Why It Matters Here

Storage is usually the hardest-to-change decision in a design. Code is ephemeral, deployments roll, services get rewritten. Data is forever.

  • Pick SQL when you should have picked a wide-column store, and you will pay with manual sharding, hot-partition incidents, and complex migrations.
  • Pick a document store when you should have picked SQL, and you will reinvent joins, constraints, and schema evolution badly.
  • Pick one store when you should have picked two, and you will force one component's needs on another's workload.

Cluster 3 (data model) and Cluster 4 (scale) both depend on this being right.

Concrete Example: Social Feed (Polyglot Persistence)

For our social feed system:

WorkloadAccess patternBest fitWhy
User accounts, profilesPoint lookup by user ID; joins with friends/followees; small per-row; strong consistency on account changesSQL (Postgres/MySQL) with sharding by user IDRelational structure + transactions for signup/auth is worth more than raw scale
PostsAppend-only writes; read by post ID, or by author ID over a time rangeWide-column (Cassandra) or NoSQL KV sharded by author ID, clustering by created_at descHigh write throughput; range scans per author are the native access pattern
Timeline per userBillions of users × recent items; reads dominated by "give me last 100 items for user X"; write amplification on fan-outWide-column sharded by user ID, clustering by created_at desc; hot tail cached in RedisKeyed read of the tail is the only op; SQL joins would be catastrophic
Media blobsInfrequent write, heavy read, static bytesObject store (S3) behind a CDNPurpose-built; SQL or KV would be a waste of money and latency
Search (by text)Inverted-index queries, scoring, filtersSearch engine (Elasticsearch/OpenSearch) indexed from the primary storeNo OLTP DB does this well at scale
Metrics / analytics eventsVery high-write, append-only, aggregate-only readsKafka -> columnar warehouse (ClickHouse, BigQuery, Snowflake)OLAP shape; OLTP stores collapse under this

This is a polyglot persistence design. Each store is chosen for its native workload; the primary is SQL or the wide-column store, and other stores are updated via event streams.

For a URL shortener, by contrast, a single wide-column or KV store is usually enough, because the access pattern is a single point lookup and writes are modest.

Concrete Example 2: E-Commerce Checkout

Four workloads, four stores:

WorkloadNative shapeChosen storeRejected alternative
Product catalogRead-heavy, point lookup by SKU, some faceted searchDocument store (MongoDB) with SKU as primary key + search engine for faceted searchRejected SQL-only because faceted search on LIKE clauses performs terribly
InventoryStrict consistency required -- must not oversell; high contention on popular SKUs at sale timeSQL (Postgres) with row-level locking, sharded by SKURejected Redis counter because audit/recovery is harder and audit/regulatory requirement demands a transactional trail
OrdersWrite-heavy during peak; read by user_id and by order_id; one year of history onlineSQL (Postgres) primary, sharded by customer_id; one-year retention hot, archived to object store afterRejected wide-column because reporting and admin queries need ad-hoc SQL
Session / cartHigh read-write, ephemeral, TTL-expiringRedis (single source for cart during session) + async write-through to durable store on checkoutRejected SQL because TTL-based eviction and sub-ms latency don't match SQL workload

The dominant architectural point: different workloads, different stores, orchestrated by events. The primary store for money-touching operations stays transactional SQL; everything else is chosen for its native access pattern.

Common Confusion / Misconceptions

"NoSQL scales; SQL does not." False. Modern Postgres and MySQL handle millions of writes/sec with sharding and proper schema. "NoSQL scales" shortcut is usually "I do not want to think about schema". Both scale; they scale differently.

"We can use one database for everything." You can; you usually should not. Forcing analytics on an OLTP store will either cripple the OLTP store or cripple the analytics. Forcing full-text search on SQL's LIKE queries is a known incident factory.

"Strongly consistent is always better." No. Strong consistency usually means synchronous cross-AZ writes, which raises P99 and reduces availability. Choose strong consistency where the business requires it (money, identity, auth) and eventual consistency where the business can tolerate it (counters, feeds, search).

"Partitioning key is a detail." Partitioning key is the storage decision. See Cluster 3 concept 8. Picking the wrong partitioning key will rebuild itself as "our hot shard is melting under celebrity load" in Cluster 4.

"Polyglot persistence is over-engineering." Sometimes. For a small system, a single Postgres really is the right answer. Polyglot justifies itself when one workload's requirements are structurally incompatible with another's (full-text + OLTP, high-write time-series + ad-hoc SQL). Name the incompatibility before adding a store.

How To Use It

A seven-line storage sketch per workload:

  1. Workload name (e.g., "Timeline").
  2. Access pattern in one sentence.
  3. Consistency requirement in one phrase ("read-your-write per user", "bounded staleness", "strict serializable").
  4. Rough size today and in 1-2 years.
  5. Chosen store family (SQL / wide-column / KV / document / search / object / time-series / graph).
  6. Partitioning key (if distributed).
  7. One sentence naming the rejected alternative and the reason.

Do this once per workload. Three to five workloads is typical. A good design has at most three primary stores; extras are specialized read models fed by events.

Transfer / Where This Shows Up Later

  • Cluster 3 concept 8 (data model) refines the partitioning-key decision per store.
  • Cluster 3 concept 9 (consistency) makes the consistency contract explicit per mutable table.
  • Cluster 4 concept 10 (scale) audits whether the chosen store survives 10× and 100×.
  • S8M3 (data patterns) is an entire module on the pros/cons of each store family, with deeper coverage of polyglot architectures.
  • S9 (cloud) maps store families to managed services (RDS/Aurora for SQL, DynamoDB/Cassandra for KV/wide-column, S3 for blobs, OpenSearch/Elasticsearch for search).
  • S10 capstone/interviews: the single most-asked follow-up is "why this database?" -- a good seven-line sketch per workload answers it before the question.

Check Yourself

  1. Name one workload where Postgres beats DynamoDB and one where DynamoDB beats Postgres. Why, in one sentence each?
  2. Why does a wide-column store usually require you to know the query shape before the schema, while SQL lets you query flexibly post-hoc?
  3. What would go wrong if a social feed's timeline was stored in a single SQL table with WHERE user_id = ? as the read query, at 1M reads/sec?
  4. Defend a polyglot choice using two workload incompatibilities.

Mini Drill or Application

For each prompt, produce a seven-line storage sketch per workload, in under ten minutes total:

  1. Chat system: user accounts, message store, unread-count per conversation, message search.
  2. Ride-share: users, live driver location, trip history, pricing audit log.
  3. E-commerce: catalog, inventory, orders, reviews, recommendations.

Then explicitly name one workload that would tempt you into the wrong choice if you hurried, and why.

Read This Only If Stuck