Module 1: Relational Databases & SQL
Primary text: Database System Concepts (Silberschatz, Korth, Sudarshan)
Selective support: Designing Data-Intensive Applications (Kleppmann) for modern architectural framing; Database Internals (Petrov) for occasional implementation peeks; PostgreSQL and SQLite official docs for exact syntax and semantics
This guide is the primary teacher. You do not need to read the source books front-to-back to complete this module. You do need to become operationally strong at the relational model, the SQL pipeline, schema design, constraints, and reading basic query plans before moving into storage engines, replication, and transactions.
Scope of This Module
This module is not a SQL cheat sheet. It is where data modeling becomes something you can defend, and where "SQL works" becomes "I can predict what SQL will do and how the engine will run it."
What it covers in depth:
- the relational model as tuples over domains, relations as sets, and why the model has lasted for fifty years
- keys (primary, candidate, foreign, surrogate vs natural) and why referential integrity is a feature, not a ceremony
- relational algebra as the semantic core behind SQL
- the SQL logical pipeline (FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY) and why running order differs from reading order
- JOINs (inner, outer, semi, anti) and set operations (UNION, INTERSECT, EXCEPT)
- subqueries, CTEs, and window functions as the three ways SQL composes
- functional dependencies and normal forms (1NF, 2NF, 3NF, BCNF), plus when denormalization is the right answer
- ER modeling and translation to tables
- constraints (NOT NULL, UNIQUE, CHECK, FOREIGN KEY, cascading) and how they protect data from your future self
- schema migrations, including online migrations, reversibility, and blast-radius control
- views, materialized views, and stored procedures
- reading an
EXPLAIN/EXPLAIN ANALYZEplan enough to diagnose a slow query - scan vs seek, the N+1 query pattern, and why a correct query can still be a bad query
- transactions at the SQL surface:
BEGIN,COMMIT,ROLLBACK, isolation levels preview
What it deliberately does not try to finish here:
- B-tree and LSM-tree internals (Module 2)
- replication, partitioning, and leader election (Modules 3 and 5)
- isolation-level implementations and concurrency-control protocols (Module 4)
- NoSQL, graph, and document databases beyond a bridging comparison
This is the relational foundation the rest of Semester 6 depends on.
Before You Start
Answer these closed-book before starting the main path:
- Why is a relation a set of tuples and not a list, and what consequence does that have for SQL result semantics?
- What is the difference between a primary key, a candidate key, and a surrogate key?
- If an application issues
SELECT ... FROM orders JOIN customers ON ... WHERE ..., in what logical order does the engine evaluate the clauses? - Given
Orders(customer_id, total)and a query that returns the top spender per country, where would you reach forGROUP BY, a window function, or a subquery? - If a query is slow, what is the first thing you ask the database to show you before changing any code?
Diagnostic Interpretation
4-5 solid answers
- You are ready for the full path.
2-3 solid answers
- Continue, but expect extra time in the SQL pipeline and normalization clusters.
0-1 solid answers
- Revisit Semester 1 Module 1 (sets, predicate logic) and Semester 2 Module 1 (algorithmic reasoning) briefly. The relational model inherits its structure from both.
What This Module Is For
Relational databases are still the default store for correctness-critical data, and SQL is still the language in which most data questions are asked in industry. Throughout your career you will repeatedly be asked:
- can this schema answer every question we have without contortions?
- will this query return the right answer, and will it still return it in six months?
- why is this query slow, and is it the query, the schema, or the index?
- can two transactions step on each other, and what is the contract with the caller?
- how do we add a column to a billion-row table without downtime?
This module builds the relational reasoning needed for:
- storage engines and indexes (Module 2)
- replication and partitioning (Module 3)
- transactions and consistency (Module 4)
- distributed systems tradeoffs, where "the database" often becomes several cooperating databases (Module 5)
- backend engineering, where data correctness usually matters more than framework taste
You are learning to think about data without handwaving.
Concept Map
How To Use This Module
Work in order. The later clusters only make sense if the earlier modeling and query habits are stable.
Cluster 1: The Relational Model
| Order | Concept | Type | Focus |
|---|---|---|---|
| 1 | Relations, Tuples, Attributes, Domains | PRIMARY | Codd's model: tables as sets of tuples over domains |
| 2 | Keys: Primary, Candidate, Foreign, Surrogate vs Natural | PRIMARY | Identity, uniqueness, and referential integrity |
| 3 | Relational Algebra: Selection, Projection, Join, Union | PRIMARY | The closed algebra that gives SQL its semantics |
Cluster mastery check: Can you take an English question, translate it into relational algebra, and then defend why your SQL matches that algebra?
Cluster 2: SQL as a Query Language
| Order | Concept | Type | Focus |
|---|---|---|---|
| 4 | The SQL Logical Pipeline: SELECT/FROM/WHERE/GROUP BY/HAVING/ORDER BY | PRIMARY | Clause evaluation order and why it differs from reading order |
| 5 | JOINs and Set Operations: Inner, Outer, Semi, Anti | PRIMARY | Every join as a filter on a Cartesian product, plus UNION/INTERSECT/EXCEPT |
| 6 | Subqueries, CTEs, and Window Functions | PRIMARY | The three ways SQL composes queries |
Cluster mastery check: Given an English business question, can you write a correct SQL query, say in what order its clauses run, and name at least one equivalent rewrite?
Cluster 3: Schema Design and Normalization
| Order | Concept | Type | Focus |
|---|---|---|---|
| 7 | Functional Dependencies and Normal Forms: 1NF, 2NF, 3NF, BCNF | PRIMARY | FDs, closure, decomposition, and lossless-join design |
| 8 | Denormalization: When and Why | SUPPORTING | Read-path tradeoffs, redundancy vs performance |
| 9 | ER Modeling and Translation to Tables | PRIMARY | Entities, relationships, cardinalities, and disciplined translation |
Cluster mastery check: Given a domain description, can you produce an ER diagram, translate it to 3NF tables, and explain every denormalization you chose to keep?
Cluster 4: Constraints, Integrity, and DDL
| Order | Concept | Type | Focus |
|---|---|---|---|
| 10 | Constraints: NOT NULL, UNIQUE, CHECK, FOREIGN KEY, Cascading | PRIMARY | Integrity enforced by the database rather than hoped for by the app |
| 11 | Schema Migrations: Adding/Removing Columns, Online Migrations | PRIMARY | Reversible, forward-compatible, low-blast-radius DDL |
| 12 | Views, Materialized Views, and Stored Procedures | SUPPORTING | Abstraction, precomputation, and server-side logic |
Cluster mastery check: Can you add, remove, or backfill a column on a large table without downtime, and can you say what every FOREIGN KEY ... ON DELETE clause will do under stress?
Cluster 5: SQL Performance Basics
| Order | Concept | Type | Focus |
|---|---|---|---|
| 13 | Reading an EXPLAIN / EXPLAIN ANALYZE Plan | PRIMARY | Nodes, rows, costs, and the estimated-vs-actual discipline |
| 14 | The Cost of a Bad Query: Scan vs Seek, N+1 | PRIMARY | Where queries go wrong and how you recognize it early |
| 15 | Transactions at the SQL Level: BEGIN, COMMIT, ROLLBACK, Isolation Preview | SUPPORTING | The SQL-surface view of ACID, leaving deep mechanics for Module 4 |
Cluster mastery check: Given a slow query and its EXPLAIN ANALYZE output, can you point to the expensive node, say why it is expensive, and propose either an index, a rewrite, or a schema change?
Then work these practice pages:
| Order | Practice path | Focus |
|---|---|---|
| 1 | Relational Algebra to SQL Workshop | Pipeline order, joins, subqueries, CTEs, and window functions |
| 2 | Schema Design Lab | Sample domain design, keys, and ER-to-SQL translation |
| 3 | Query Performance Clinic | EXPLAIN, scan-vs-seek, and migration-side query drills |
| 4 | SQL Code Katas | Timed SQL and schema drills |
Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.
Learning Objectives
By the end of this module you should be able to:
- Define a relation formally and explain why relational results are sets, not lists.
- Distinguish primary, candidate, foreign, surrogate, and natural keys, and justify a choice in a real schema.
- Translate an English question into relational algebra and then into SQL, and explain why both match.
- State the logical evaluation order of a SQL query and use it to predict results and errors.
- Write and read inner, outer, semi, and anti joins, plus
UNION,INTERSECT, andEXCEPT, with correct semantics for NULLs and duplicates. - Use subqueries, CTEs, and window functions, and choose the simplest composition that solves a given problem.
- Identify functional dependencies, decompose a schema to 3NF or BCNF, and justify any denormalization.
- Produce an ER diagram for a non-trivial domain and translate it to tables with complete constraints.
- Write DDL with
NOT NULL,UNIQUE,CHECK, andFOREIGN KEY(includingON DELETE/ON UPDATE) that prevents invalid states. - Read an
EXPLAIN ANALYZEplan, diagnose scan-vs-seek and N+1 issues, and propose targeted fixes.
Outputs
- a relational-modeling notebook with at least 3 end-to-end domain designs (ER diagram, schema DDL, written tradeoff notes)
- a SQL catalog with at least 25 queries covering joins, subqueries, CTEs, window functions, and set operations, each with its own test data and expected output
- a normalization sheet showing at least 3 schemas walked from 1NF through BCNF with decomposition steps
- a constraints gallery including at least 8
CREATE TABLEstatements that exercise every constraint type and every cascade mode - a migration log of at least 4 online-migration playbooks (add column, drop column, rename, backfill), each with rollback plan
- an
EXPLAINjournal with at least 10 before/after query-plan comparisons - a mistake log naming at least 10 recurring SQL or design errors (
GROUP BY wrong columns,NULL != NULL confusion,implicit cross join,forgotten cascade,1+N query, etc.) - one short memo connecting Module 1 habits to the indexing, transaction, and distributed-system modules ahead
Completion Standard
You have completed Module 1 when all of these are true:
- you can model a new domain as an ER diagram and defend the schema it produces
- you can write a non-trivial SQL query and predict its result shape and types before running it
- you can explain the logical evaluation order of any SQL statement you write
- you can normalize a table to 3NF and say, concretely, which anomalies each normal form eliminated
- you can add a constraint to a running system without corrupting existing data
- you can read an
EXPLAIN ANALYZEplan and propose at least one fix for a slow query - you can run a transaction and say, in words, what would happen to other sessions at
READ COMMITTEDversusSERIALIZABLE
If a query "returns what you expected on the sample data" but you cannot say why under different data, the module is not complete.
Reading Policy
- Concept pages are the main path.
- Local book chunks are selective reinforcement, not a second syllabus.
- PostgreSQL and SQLite official docs are authoritative for exact syntax and semantics; reach for them when the question is "what does this do" rather than "what does this mean".
Read only if stuckmeans try the concept page, self-check, and drill first.Optional deep divemeans additional nuance or exercise volume, not required progression.- Because this is the foundation for the rest of Semester 6, written modeling justifications and query explanations are required, not optional enrichment.
Suggested Weekly Flow
| Day | Work |
|---|---|
| 1 | Concepts 1-3, translate 5 English questions into relational algebra |
| 2 | Concepts 4-5, write 8 queries across JOINs and set operations |
| 3 | Concept 6, rewrite 4 subquery patterns as CTEs and then as window functions |
| 4 | Concepts 7-9, normalize one given schema to 3NF with written justification |
| 5 | Concepts 10-12, build one constraint gallery on your own schema |
| 6 | Concepts 13-15, run EXPLAIN ANALYZE on 5 queries and annotate the output |
| 7 | Practice 1 (lab) and Practice 2 (workshop) |
| 8 | Practice 3 (clinic), online-migration drill |
| 9 | Practice 4 (code katas), first quiz attempt |
| 10 | Quiz remediation and mistake-log cleanup |
Reference
If you need exact links into the local chunked books, use Reference and Selective Reading.
The Database (Relational) tutorial builds a SQLite-clone in C: parser, planner, executor, B+ tree storage. The natural deep-dive companion to this module. See Build Your Own X overview.
Rich Learning Pages
Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread
Model Artifact Calibration
For database performance evidence, compare your analysis to the query-plan review model artifact.