Module 1: Relational Databases & SQL: Case Studies
These case studies connect relational modeling, SQL, constraints, migrations, and query planning to real engineering work. Use them after the concept clusters and before the project. The point is not to memorize the systems; it is to see the line from concept -> constraint -> decision -> consequence.
How To Use These Case Studies
For each case study:
- Read the scenario and identify the core relational idea.
- Write the schema, query, plan, or migration path yourself.
- Name the failure mode before reading the repair.
- Produce the artifact listed at the end.
- Add one mistake-log entry if your first answer was wrong.
Required evidence for this page:
- one schema diagram
- one query rewrite
- one
EXPLAINor query-plan note - one migration plan
- one paragraph connecting the case to a later Semester 6 module
Case Study 1: Referential Integrity Is Not Optional Bookkeeping
Scenario: An e-commerce system stores customers, orders, and order_items. Early versions let the application enforce relationships. Over time, import scripts, admin tools, retries, and bug fixes create rows that the application did not expect: orders pointing at deleted customers, order items pointing at missing products, and nullable columns that later code assumes are always present.
Module concept: primary keys, foreign keys, OT NULL, CHECK, UNIQUE`, and referential integrity.
Source anchor: PostgreSQL's constraints documentation explains that constraints add control beyond data types; primary keys require uniqueness and non-nullness, and foreign keys maintain references between related tables. It also cautions that CHECK constraints are for the new or updated row, not cross-row/table guarantees; use UNIQUE, EXCLUDE, or FOREIGN KEY for those restrictions where possible. See PostgreSQL constraints.
Bad Design
CREATE TABLE orders (
id bigint PRIMARY KEY,
customer_id bigint,
status text,
total_cents integer
);
CREATE TABLE order_items (
id bigint PRIMARY KEY,
order_id bigint,
product_id bigint,
quantity integer,
unit_price_cents integer
);
The application "knows" that customer_id, order_id, product_id, and quantity should be valid. The database does not.
Better Design
CREATE TABLE customers (
id bigint PRIMARY KEY,
email text NOT NULL UNIQUE
);
CREATE TABLE products (
id bigint PRIMARY KEY,
sku text NOT NULL UNIQUE,
active boolean NOT NULL DEFAULT true
);
CREATE TABLE orders (
id bigint PRIMARY KEY,
customer_id bigint NOT NULL REFERENCES customers(id),
status text NOT NULL CHECK (status IN ('draft', 'paid', 'shipped', 'cancelled')),
total_cents integer NOT NULL CHECK (total_cents >= 0)
);
CREATE TABLE order_items (
id bigint PRIMARY KEY,
order_id bigint NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
product_id bigint NOT NULL REFERENCES products(id),
quantity integer NOT NULL CHECK (quantity > 0),
unit_price_cents integer NOT NULL CHECK (unit_price_cents >= 0),
UNIQUE (order_id, product_id)
);
Engineering Decision
Let the database protect invariants that must hold regardless of caller. Let the application handle workflow rules that are contextual, user-facing, or require external state.
Tradeoff
| Choice | Gain | Cost |
|---|---|---|
| Database constraints | prevent impossible states from every write path | migrations and imports must satisfy stricter rules |
| Application-only validation | flexible and easy to change | every caller must remember the rule forever |
ON DELETE CASCADE | simplifies cleanup of dependent rows | dangerous if parent deletion is too easy |
RESTRICT / default behavior | safer for critical data | requires explicit cleanup flow |
Failure Mode
The failure is not only "bad data." It is invisible contract drift: one team writes code assuming every order has a customer; another tool creates orphaned orders; a later report silently miscounts revenue.
Artifact
Draw the ER model and mark which invariants belong in:
- data type
- OT NULL`
CHECKUNIQUEFOREIGN KEY- application service
Then write one paragraph explaining why cross-table business rules should not be forced into a CHECK constraint.
Project / Capstone Connection
Use the ER model and invariant split as the starting point for the Semester 6 project schema, then carry the same ownership boundaries into later indexing, transaction, and replication decisions.
Case Study 2: The SQL Logical Pipeline Explains "Why Can't I Filter On This Window Result?"
Scenario: A product team wants the top two purchases per customer. A learner tries to write WHERE row_number() <= 2 in the same query that computes row_number().
Module concept: SQL logical pipeline, subqueries, CTEs, and window functions.
Source anchor: PostgreSQL's window-function tutorial explains that window functions calculate across related rows while preserving row identity, and that they are allowed only in the SELECT list and ORDER BY; to filter after a window calculation, use a sub-select. See PostgreSQL window functions tutorial.
Wrong Query
SELECT
customer_id,
order_id,
total_cents,
row_number() OVER (
PARTITION BY customer_id
ORDER BY total_cents DESC
) AS purchase_rank
FROM orders
WHERE purchase_rank <= 2;
This is wrong because WHERE is logically evaluated before the SELECT output alias exists and before the window result is available.
Correct Query
SELECT *
FROM (
SELECT
customer_id,
id AS order_id,
total_cents,
row_number() OVER (
PARTITION BY customer_id
ORDER BY total_cents DESC, id
) AS purchase_rank
FROM orders
WHERE status = 'paid'
) ranked_orders
WHERE purchase_rank <= 2
ORDER BY customer_id, purchase_rank;
Engineering Decision
Use a subquery or CTE when a later logical phase needs to consume a value produced by an earlier nested query. Do not fight the pipeline; make the pipeline explicit.
Tradeoff
| Technique | Use when | Risk |
|---|---|---|
GROUP BY | one output row per group | loses row identity |
| window function | each row must keep identity while seeing its partition | easy to misuse in WHERE/HAVING |
| subquery / CTE | you need to filter or reuse derived results | can hide cost if you stop reading plans |
Failure Mode
The learner memorizes syntax but not phase order. That produces fragile SQL: aliases used too early, aggregates mixed with row-level filters, and window outputs treated as if they existed in WHERE.
Artifact
Create a pipeline diagram for the corrected query:
FROM orders
-> WHERE status = 'paid'
-> SELECT row_number() OVER (...)
-> outer WHERE purchase_rank <= 2
-> ORDER BY
Then write a second version using a CTE and compare readability.
Project / Capstone Connection
Reuse the pipeline diagram format for project queries that mix filtering, ranking, aggregation, or window logic so query reviews stay tied to SQL phase order.
Case Study 3: N+1 Queries Are A Relational Failure Hiding In Application Code
Scenario: A web page lists 20 issues. Each issue shows author, assignee, labels, and the latest comment. The first implementation loads issues in one query and then lazily loads related data inside a loop. It works in development, but production latency grows with every row shown.
Module concept: joins, query count, relationship loading, N+1 query pattern.
Source anchor: GitLab's QueryRecorder documentation describes tests that count SQL queries and catch N+1 patterns by comparing query counts before and after adding records. It warns that merge requests should not silently increase query counts and suggests adding tests when preloading relationships. See GitLab QueryRecorder.
Bad Access Pattern
issues = Issue.where(project_id: project.id).limit(20)
issues.each do |issue|
puts issue.author.name
puts issue.assignee.name
puts issue.labels.map(&:title)
end
This looks like one page query, but it may become:
1 query for issues
20 queries for authors
20 queries for assignees
20 queries for labels
Better Access Pattern
The SQL-oriented repair is to decide what shape the page needs and fetch relationships deliberately:
SELECT
issues.id,
issues.title,
authors.name AS author_name,
assignees.name AS assignee_name
FROM issues
JOIN users authors ON authors.id = issues.author_id
LEFT JOIN users assignees ON assignees.id = issues.assignee_id
WHERE issues.project_id = $1
ORDER BY issues.created_at DESC
LIMIT 20;
For many-to-many labels, use a separate preloaded query keyed by the visible issue IDs, or aggregate deliberately if the output contract allows it.
Engineering Decision
Do not judge query cost by the number of lines in application code. Judge it by:
- number of SQL statements
- rows scanned
- rows returned
- join shape
- plan stability
- indexes supporting the access pattern
Tradeoff
| Approach | Gain | Cost |
|---|---|---|
| Lazy loading | simple local code | query count grows with rows |
| Large join | fewer round trips | duplicate rows and harder pagination if joins multiply cardinality |
| Explicit preload | predictable query count | requires knowing page shape |
| Aggregated SQL | compact result | can mix presentation concerns into query layer |
Failure Mode
The application hides relational access. The database is doing many tiny lookups, but the developer sees one loop.
Artifact
Write a query-count test in pseudocode:
baseline_count = count_queries { render_page(issue_count: 5) }
larger_count = count_queries { render_page(issue_count: 10) }
assert larger_count <= baseline_count + allowed_constant
Then explain what relationship must be preloaded when the assertion fails.
Project / Capstone Connection
Apply the same query-count discipline to any project list page or report view before it becomes a hidden latency bottleneck.
Case Study 4: Reading EXPLAIN Before Adding Random Indexes
Scenario: A reporting query is slow:
SELECT *
FROM orders
WHERE customer_id = $1
AND status = 'paid'
ORDER BY created_at DESC
LIMIT 50;
Someone proposes "add indexes on every column in the WHERE clause." That may help, but it may also waste write performance and still fail to support the sort.
Module concept: EXPLAIN, sequential scan vs index scan, multicolumn indexes, sort avoidance.
Source anchors: PostgreSQL's EXPLAIN docs state that the plan shows how tables are scanned and joined, including sequential scans, index scans, and join algorithms. The performance tips explain that plan nodes form a tree and that EXPLAIN ANALYZE executes the query to compare estimates with real row counts and timing. PostgreSQL's multicolumn-index docs explain that B-tree indexes are most effective when constraints use the leading columns. See EXPLAIN, Using EXPLAIN, and Multicolumn indexes.
Bad Repair
CREATE INDEX orders_customer_id_idx ON orders(customer_id);
CREATE INDEX orders_status_idx ON orders(status);
CREATE INDEX orders_created_at_idx ON orders(created_at);
This may not match the query's combined access pattern.
Better Hypothesis
For "one customer's recent paid orders", a composite index may fit:
CREATE INDEX orders_customer_status_created_idx
ON orders(customer_id, status, created_at DESC);
The intended plan shape is:
Index Scan using orders_customer_status_created_idx
Index Cond: customer_id = $1 AND status = 'paid'
returns rows already near created_at DESC order
stops after LIMIT 50
Engineering Decision
Read the plan before and after. Do not add an index unless you can state which scan, filter, join, or sort it is meant to change.
Tradeoff
| Index choice | Gain | Cost |
|---|---|---|
| single-column indexes | reusable for varied queries | may not support combined filter+sort well |
| composite index | strong for one access pattern | order matters; less general |
| covering index | fewer heap reads | larger index and more write overhead |
| no new index | preserves write speed | slow read path remains |
Failure Mode
Index cargo culting: every slow query gets another index, but nobody can explain why the planner should use it.
Artifact
Write an EXPLAIN review note with these fields:
Query:
Current plan:
Worst node:
Estimated rows vs actual rows:
Proposed index:
Expected plan change:
Write overhead accepted:
Rollback plan:
Project / Capstone Connection
Use the EXPLAIN review note as the standard artifact for every new project index so read-path gains are justified against write amplification.
Case Study 5: Materialized Views Are Read Models With A Freshness Contract
Scenario: A dashboard shows daily revenue by product category. Computing it live requires scanning and grouping millions of rows. The product team wants the dashboard to load quickly, but finance wants the numbers to be explainable.
Module concept: views, materialized views, refresh strategy, denormalized read models.
Source anchors: PostgreSQL's materialized-view docs explain that materialized views persist query results in table-like form and can be refreshed from the stored query. The REFRESH MATERIALIZED VIEW docs explain that refresh replaces contents, and CONCURRENTLY can avoid blocking concurrent selects when requirements are met. See Materialized Views and REFRESH MATERIALIZED VIEW.
Live Query
SELECT
date_trunc('day', orders.paid_at) AS day,
products.category_id,
sum(order_items.quantity * order_items.unit_price_cents) AS revenue_cents
FROM orders
JOIN order_items ON order_items.order_id = orders.id
JOIN products ON products.id = order_items.product_id
WHERE orders.status = 'paid'
GROUP BY 1, 2;
Correct, but expensive if run often.
Materialized Read Model
CREATE MATERIALIZED VIEW daily_category_revenue AS
SELECT
date_trunc('day', orders.paid_at) AS day,
products.category_id,
sum(order_items.quantity * order_items.unit_price_cents) AS revenue_cents
FROM orders
JOIN order_items ON order_items.order_id = orders.id
JOIN products ON products.id = order_items.product_id
WHERE orders.status = 'paid'
GROUP BY 1, 2;
CREATE UNIQUE INDEX daily_category_revenue_key
ON daily_category_revenue(day, category_id);
Refresh path:
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_category_revenue;
Engineering Decision
Use a materialized view when read speed matters more than instant freshness and the refresh contract is explicit.
Tradeoff
| Choice | Gain | Cost |
|---|---|---|
| live aggregate query | always current | expensive and latency-sensitive |
| materialized view | fast reads and reusable query definition | stale between refreshes |
| summary table maintained by app/jobs | flexible incremental updates | more custom correctness burden |
Failure Mode
Calling a materialized view "just a cache" without defining freshness. If the dashboard says "today", does that mean real-time, last refresh, or closed accounting day?
Artifact
Write a freshness contract:
Dashboard name:
Max tolerated staleness:
Refresh schedule:
Refresh owner:
What happens if refresh fails:
How user sees last refreshed time:
Then explain why CONCURRENTLY needs a suitable unique index and why only one refresh can run at a time for a materialized view.
Project / Capstone Connection
Bring this freshness contract forward when the project adds dashboards, summary tables, or asynchronous read models that trade immediacy for speed.
Case Study 6: Online Migration From Embedded Data To A Relational Table
Scenario: A billing system originally stores subscription data inside a customer record. New requirements allow multiple subscriptions per customer and queries like "find active subscriptions expiring this week." Embedded subscription data now causes large customer updates and awkward scans.
Module concept: schema migrations, normalization, backfill, dual writes, read-path migration, write-path migration.
Source anchor: Stripe's engineering writeup on online migrations describes a four-phase pattern for large online migrations: dual-write to old and new stores, move reads, move writes, and remove old data. It frames the migration around uptime and accuracy because services must continue operating during the change. See Stripe: Online migrations at scale.
Original Shape
customers
id
email
subscription_status
subscription_plan
subscription_renews_at
This design works while there is exactly one subscription per customer. It breaks when multiplicity changes.
Target Shape
CREATE TABLE subscriptions (
id bigint PRIMARY KEY,
customer_id bigint NOT NULL REFERENCES customers(id),
plan_id bigint NOT NULL REFERENCES plans(id),
status text NOT NULL CHECK (status IN ('trialing', 'active', 'past_due', 'cancelled')),
renews_at timestamptz,
created_at timestamptz NOT NULL,
updated_at timestamptz NOT NULL
);
CREATE INDEX subscriptions_customer_status_idx
ON subscriptions(customer_id, status);
Safer Migration Plan
| Phase | Action | Evidence |
|---|---|---|
| 1. Expand | create new table while old columns remain | migration can deploy without code break |
| 2. Dual write | write subscription changes to both locations | metrics comparing old/new values |
| 3. Backfill | copy historical customer subscription data into subscriptions | idempotent job with progress logs |
| 4. Read switch | move reads to subscriptions | shadow-read comparison or sampled diff |
| 5. Write switch | stop writing old columns | alerts for accidental old writes |
| 6. Contract | remove old columns later | post-deploy migration and rollback note |
Engineering Decision
Normalize when the data has independent identity, multiplicity, lifecycle, and query needs. Migrate in phases when the system cannot stop.
Tradeoff
| Choice | Gain | Cost |
|---|---|---|
| keep embedded fields | simple old code | blocks multiple subscriptions and efficient queries |
| immediate cutover | simple mental model | high downtime and correctness risk |
| phased online migration | preserves uptime and allows verification | longer period of dual-write complexity |
Failure Mode
Changing the schema and all code paths in one deploy. If anything is missed, either old code cannot read new data or new code cannot trust old data.
Artifact
Write a migration ADR with:
- old model
- target model
- expand phase
- dual-write contract
- backfill idempotency rule
- read-switch verification
- contract/removal phase
- rollback point for each phase
Project / Capstone Connection
Use the migration ADR template in the Semester 6 project whenever a schema change cannot be treated as a one-shot DDL operation.
Case Study 7: GitLab-Style Post-Deployment Migrations Separate Code Rollout From Data Mutation
Scenario: A Rails application wants to remove a column used by the old version of the application. Removing the column in the same migration as the deploy risks downtime because old application processes may still reference it.
Module concept: schema migration ordering, backward compatibility, post-deployment migration, blast-radius control.
Source anchor: GitLab's post-deployment migration docs describe migrations that can run after deployment, including the case where a column removal would otherwise require downtime. GitLab's release docs describe a post-deploy migration pipeline that executes migrations separately and checks production conditions. See GitLab post deployment migrations and GitLab post-deploy migration pipeline.
Unsafe Plan
deploy code + remove column in same release
Risk: one old process still reads the column and fails.
Safer Plan
Release A:
code stops depending on old column
database keeps old column
Post-deploy migration:
remove or clean up old column after new code is active
Release B:
remove dead compatibility code
Engineering Decision
For high-traffic systems, schema changes are deployment choreography, not just DDL. Separate changes that must be visible to old code from changes that can happen only after new code is everywhere.
Tradeoff
| Choice | Gain | Cost |
|---|---|---|
| one-step migration | easy locally | downtime or rollback risk |
| post-deploy migration | safer rollout | extra operational process |
| batched background migration | controls load | may take days/weeks and needs monitoring |
Failure Mode
Thinking "migration passed in staging" means "safe in production." Production has mixed versions, long-running jobs, replicas, bigger tables, and change windows.
Artifact
Create a deployment timeline showing:
- code version before deploy
- code version after deploy
- when the old column is still required
- when it becomes safe to drop
- rollback before and after the post-deploy step
Project / Capstone Connection
Reuse the deployment timeline during project release planning so schema compatibility is reviewed separately from application rollout.
Case Study 8: Denormalization Is A Decision, Not A Shortcut
Scenario: A support dashboard needs to display ticket count, last response time, customer plan, and account health for every customer. A normalized query joins customers, plans, tickets, messages, and health_scores. The page is slow and complex.
Module concept: normalization, denormalization, views, materialized views, query ownership.
Source anchor: PostgreSQL's materialized-view documentation frames persisted read models as table-like query results that must be refreshed from a stored query, which is the right anchor when denormalization is a deliberate read-model decision instead of a second source of truth. See Materialized Views.
Fully Normalized Query Shape
SELECT
customers.id,
customers.name,
plans.name AS plan_name,
count(tickets.id) AS open_ticket_count,
max(messages.created_at) AS last_response_at,
health_scores.score
FROM customers
JOIN plans ON plans.id = customers.plan_id
LEFT JOIN tickets ON tickets.customer_id = customers.id
LEFT JOIN messages ON messages.ticket_id = tickets.id
LEFT JOIN health_scores ON health_scores.customer_id = customers.id
WHERE tickets.status = 'open'
GROUP BY customers.id, customers.name, plans.name, health_scores.score;
This can be correct and still not be the right read path for a frequently refreshed dashboard.
Denormalized Read Model
CREATE TABLE customer_support_summaries (
customer_id bigint PRIMARY KEY REFERENCES customers(id),
plan_name text NOT NULL,
open_ticket_count integer NOT NULL,
last_response_at timestamptz,
health_score numeric,
refreshed_at timestamptz NOT NULL
);
Engineering Decision
Keep the normalized source of truth. Add a denormalized read model only when you can define:
- owner
- refresh/update mechanism
- acceptable staleness
- reconciliation query
- repair path
Tradeoff
| Choice | Gain | Cost |
|---|---|---|
| normalized live query | one source of truth | expensive joins and grouping |
| denormalized table | fast dashboard | duplicate data and drift risk |
| materialized view | database-owned query definition | refresh constraints and staleness |
| application cache | flexible | harder to audit with SQL alone |
Failure Mode
Denormalization becomes a second source of truth. Nobody knows whether the dashboard or normalized tables are authoritative.
Artifact
Write a reconciliation query that compares the summary table against the normalized source for 20 sampled customers. Add a "drift budget" statement: how much mismatch is acceptable, for how long, and who fixes it?
Project / Capstone Connection
Use the reconciliation query and drift budget whenever the project introduces a denormalized dashboard, cache table, or any read path that duplicates normalized source data.
Source Map
Use these sources for deeper study:
| Source | Use it for |
|---|---|
| PostgreSQL constraints | keys, foreign keys, CHECK, UNIQUE, referential integrity |
| PostgreSQL window functions tutorial | row identity, partitions, ranking, filtering after window calculations |
| GitLab QueryRecorder | N+1 query detection and query-count tests |
| PostgreSQL EXPLAIN and Using EXPLAIN | plan reading, scan nodes, join nodes, estimates vs actuals |
| PostgreSQL multicolumn indexes | composite-index order and when multicolumn indexes help |
| PostgreSQL materialized views and REFRESH MATERIALIZED VIEW | persisted read models and refresh behavior |
| Stripe online migrations | dual-write migration pattern for high-availability systems |
| GitLab post deployment migrations | separating code rollout from dangerous data mutation |
Completion Standard
A case study is complete when you have:
- explained the concept, constraint, decision, and consequence
- produced the requested artifact
- named one wrong approach and why it fails
- connected the case to Module 2, 3, 4, or 5 of Semester 6