Skip to main content

Module 1: Relational Databases & SQL

Primary text: Database System Concepts (Silberschatz, Korth, Sudarshan)
Selective support: Designing Data-Intensive Applications (Kleppmann) for modern architectural framing; Database Internals (Petrov) for occasional implementation peeks; PostgreSQL and SQLite official docs for exact syntax and semantics

This guide is the primary teacher. You do not need to read the source books front-to-back to complete this module. You do need to become operationally strong at the relational model, the SQL pipeline, schema design, constraints, and reading basic query plans before moving into storage engines, replication, and transactions.


Scope of This Module

This module is not a SQL cheat sheet. It is where data modeling becomes something you can defend, and where "SQL works" becomes "I can predict what SQL will do and how the engine will run it."

What it covers in depth:

  • the relational model as tuples over domains, relations as sets, and why the model has lasted for fifty years
  • keys (primary, candidate, foreign, surrogate vs natural) and why referential integrity is a feature, not a ceremony
  • relational algebra as the semantic core behind SQL
  • the SQL logical pipeline (FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY) and why running order differs from reading order
  • JOINs (inner, outer, semi, anti) and set operations (UNION, INTERSECT, EXCEPT)
  • subqueries, CTEs, and window functions as the three ways SQL composes
  • functional dependencies and normal forms (1NF, 2NF, 3NF, BCNF), plus when denormalization is the right answer
  • ER modeling and translation to tables
  • constraints (NOT NULL, UNIQUE, CHECK, FOREIGN KEY, cascading) and how they protect data from your future self
  • schema migrations, including online migrations, reversibility, and blast-radius control
  • views, materialized views, and stored procedures
  • reading an EXPLAIN / EXPLAIN ANALYZE plan enough to diagnose a slow query
  • scan vs seek, the N+1 query pattern, and why a correct query can still be a bad query
  • transactions at the SQL surface: BEGIN, COMMIT, ROLLBACK, isolation levels preview

What it deliberately does not try to finish here:

  • B-tree and LSM-tree internals (Module 2)
  • replication, partitioning, and leader election (Modules 3 and 5)
  • isolation-level implementations and concurrency-control protocols (Module 4)
  • NoSQL, graph, and document databases beyond a bridging comparison

This is the relational foundation the rest of Semester 6 depends on.


Before You Start

Answer these closed-book before starting the main path:

  1. Why is a relation a set of tuples and not a list, and what consequence does that have for SQL result semantics?
  2. What is the difference between a primary key, a candidate key, and a surrogate key?
  3. If an application issues SELECT ... FROM orders JOIN customers ON ... WHERE ..., in what logical order does the engine evaluate the clauses?
  4. Given Orders(customer_id, total) and a query that returns the top spender per country, where would you reach for GROUP BY, a window function, or a subquery?
  5. If a query is slow, what is the first thing you ask the database to show you before changing any code?

Diagnostic Interpretation

4-5 solid answers

  • You are ready for the full path.

2-3 solid answers

  • Continue, but expect extra time in the SQL pipeline and normalization clusters.

0-1 solid answers

  • Revisit Semester 1 Module 1 (sets, predicate logic) and Semester 2 Module 1 (algorithmic reasoning) briefly. The relational model inherits its structure from both.

What This Module Is For

Relational databases are still the default store for correctness-critical data, and SQL is still the language in which most data questions are asked in industry. Throughout your career you will repeatedly be asked:

  • can this schema answer every question we have without contortions?
  • will this query return the right answer, and will it still return it in six months?
  • why is this query slow, and is it the query, the schema, or the index?
  • can two transactions step on each other, and what is the contract with the caller?
  • how do we add a column to a billion-row table without downtime?

This module builds the relational reasoning needed for:

  • storage engines and indexes (Module 2)
  • replication and partitioning (Module 3)
  • transactions and consistency (Module 4)
  • distributed systems tradeoffs, where "the database" often becomes several cooperating databases (Module 5)
  • backend engineering, where data correctness usually matters more than framework taste

You are learning to think about data without handwaving.


Concept Map


How To Use This Module

Work in order. The later clusters only make sense if the earlier modeling and query habits are stable.

Cluster 1: The Relational Model

OrderConceptTypeFocus
1Relations, Tuples, Attributes, DomainsPRIMARYCodd's model: tables as sets of tuples over domains
2Keys: Primary, Candidate, Foreign, Surrogate vs NaturalPRIMARYIdentity, uniqueness, and referential integrity
3Relational Algebra: Selection, Projection, Join, UnionPRIMARYThe closed algebra that gives SQL its semantics

Cluster mastery check: Can you take an English question, translate it into relational algebra, and then defend why your SQL matches that algebra?

Cluster 2: SQL as a Query Language

OrderConceptTypeFocus
4The SQL Logical Pipeline: SELECT/FROM/WHERE/GROUP BY/HAVING/ORDER BYPRIMARYClause evaluation order and why it differs from reading order
5JOINs and Set Operations: Inner, Outer, Semi, AntiPRIMARYEvery join as a filter on a Cartesian product, plus UNION/INTERSECT/EXCEPT
6Subqueries, CTEs, and Window FunctionsPRIMARYThe three ways SQL composes queries

Cluster mastery check: Given an English business question, can you write a correct SQL query, say in what order its clauses run, and name at least one equivalent rewrite?

Cluster 3: Schema Design and Normalization

OrderConceptTypeFocus
7Functional Dependencies and Normal Forms: 1NF, 2NF, 3NF, BCNFPRIMARYFDs, closure, decomposition, and lossless-join design
8Denormalization: When and WhySUPPORTINGRead-path tradeoffs, redundancy vs performance
9ER Modeling and Translation to TablesPRIMARYEntities, relationships, cardinalities, and disciplined translation

Cluster mastery check: Given a domain description, can you produce an ER diagram, translate it to 3NF tables, and explain every denormalization you chose to keep?

Cluster 4: Constraints, Integrity, and DDL

OrderConceptTypeFocus
10Constraints: NOT NULL, UNIQUE, CHECK, FOREIGN KEY, CascadingPRIMARYIntegrity enforced by the database rather than hoped for by the app
11Schema Migrations: Adding/Removing Columns, Online MigrationsPRIMARYReversible, forward-compatible, low-blast-radius DDL
12Views, Materialized Views, and Stored ProceduresSUPPORTINGAbstraction, precomputation, and server-side logic

Cluster mastery check: Can you add, remove, or backfill a column on a large table without downtime, and can you say what every FOREIGN KEY ... ON DELETE clause will do under stress?

Cluster 5: SQL Performance Basics

OrderConceptTypeFocus
13Reading an EXPLAIN / EXPLAIN ANALYZE PlanPRIMARYNodes, rows, costs, and the estimated-vs-actual discipline
14The Cost of a Bad Query: Scan vs Seek, N+1PRIMARYWhere queries go wrong and how you recognize it early
15Transactions at the SQL Level: BEGIN, COMMIT, ROLLBACK, Isolation PreviewSUPPORTINGThe SQL-surface view of ACID, leaving deep mechanics for Module 4

Cluster mastery check: Given a slow query and its EXPLAIN ANALYZE output, can you point to the expensive node, say why it is expensive, and propose either an index, a rewrite, or a schema change?

Then work these practice pages:

OrderPractice pathFocus
1Relational Algebra to SQL WorkshopPipeline order, joins, subqueries, CTEs, and window functions
2Schema Design LabSample domain design, keys, and ER-to-SQL translation
3Query Performance ClinicEXPLAIN, scan-vs-seek, and migration-side query drills
4SQL Code KatasTimed SQL and schema drills

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.


Learning Objectives

By the end of this module you should be able to:

  1. Define a relation formally and explain why relational results are sets, not lists.
  2. Distinguish primary, candidate, foreign, surrogate, and natural keys, and justify a choice in a real schema.
  3. Translate an English question into relational algebra and then into SQL, and explain why both match.
  4. State the logical evaluation order of a SQL query and use it to predict results and errors.
  5. Write and read inner, outer, semi, and anti joins, plus UNION, INTERSECT, and EXCEPT, with correct semantics for NULLs and duplicates.
  6. Use subqueries, CTEs, and window functions, and choose the simplest composition that solves a given problem.
  7. Identify functional dependencies, decompose a schema to 3NF or BCNF, and justify any denormalization.
  8. Produce an ER diagram for a non-trivial domain and translate it to tables with complete constraints.
  9. Write DDL with NOT NULL, UNIQUE, CHECK, and FOREIGN KEY (including ON DELETE / ON UPDATE) that prevents invalid states.
  10. Read an EXPLAIN ANALYZE plan, diagnose scan-vs-seek and N+1 issues, and propose targeted fixes.

Outputs

  • a relational-modeling notebook with at least 3 end-to-end domain designs (ER diagram, schema DDL, written tradeoff notes)
  • a SQL catalog with at least 25 queries covering joins, subqueries, CTEs, window functions, and set operations, each with its own test data and expected output
  • a normalization sheet showing at least 3 schemas walked from 1NF through BCNF with decomposition steps
  • a constraints gallery including at least 8 CREATE TABLE statements that exercise every constraint type and every cascade mode
  • a migration log of at least 4 online-migration playbooks (add column, drop column, rename, backfill), each with rollback plan
  • an EXPLAIN journal with at least 10 before/after query-plan comparisons
  • a mistake log naming at least 10 recurring SQL or design errors (GROUP BY wrong columns, NULL != NULL confusion, implicit cross join, forgotten cascade, 1+N query, etc.)
  • one short memo connecting Module 1 habits to the indexing, transaction, and distributed-system modules ahead

Completion Standard

You have completed Module 1 when all of these are true:

  • you can model a new domain as an ER diagram and defend the schema it produces
  • you can write a non-trivial SQL query and predict its result shape and types before running it
  • you can explain the logical evaluation order of any SQL statement you write
  • you can normalize a table to 3NF and say, concretely, which anomalies each normal form eliminated
  • you can add a constraint to a running system without corrupting existing data
  • you can read an EXPLAIN ANALYZE plan and propose at least one fix for a slow query
  • you can run a transaction and say, in words, what would happen to other sessions at READ COMMITTED versus SERIALIZABLE

If a query "returns what you expected on the sample data" but you cannot say why under different data, the module is not complete.


Reading Policy

  • Concept pages are the main path.
  • Local book chunks are selective reinforcement, not a second syllabus.
  • PostgreSQL and SQLite official docs are authoritative for exact syntax and semantics; reach for them when the question is "what does this do" rather than "what does this mean".
  • Read only if stuck means try the concept page, self-check, and drill first.
  • Optional deep dive means additional nuance or exercise volume, not required progression.
  • Because this is the foundation for the rest of Semester 6, written modeling justifications and query explanations are required, not optional enrichment.

Suggested Weekly Flow

DayWork
1Concepts 1-3, translate 5 English questions into relational algebra
2Concepts 4-5, write 8 queries across JOINs and set operations
3Concept 6, rewrite 4 subquery patterns as CTEs and then as window functions
4Concepts 7-9, normalize one given schema to 3NF with written justification
5Concepts 10-12, build one constraint gallery on your own schema
6Concepts 13-15, run EXPLAIN ANALYZE on 5 queries and annotate the output
7Practice 1 (lab) and Practice 2 (workshop)
8Practice 3 (clinic), online-migration drill
9Practice 4 (code katas), first quiz attempt
10Quiz remediation and mistake-log cleanup

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.


Build Your Own X — elective

The Database (Relational) tutorial builds a SQLite-clone in C: parser, planner, executor, B+ tree storage. The natural deep-dive companion to this module. See Build Your Own X overview.

Rich Learning Pages

Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread


Model Artifact Calibration

For database performance evidence, compare your analysis to the query-plan review model artifact.