Module Quiz

Complete this quiz after finishing all concept and practice pages. Mix of short-answer, written-argument, and hands-on SQL / EXPLAIN questions.

Schema used throughout the quiz:

CREATE TABLE customer(
    id           BIGSERIAL PRIMARY KEY,
    email        TEXT NOT NULL UNIQUE,
    name         TEXT NOT NULL,
    region       TEXT NOT NULL,
    signed_up_on DATE NOT NULL
);
CREATE TABLE orders(
    id          BIGSERIAL PRIMARY KEY,
    customer_id BIGINT NOT NULL REFERENCES customer(id),
    total_cents INT    NOT NULL CHECK (total_cents >= 0),
    placed_at   TIMESTAMPTZ NOT NULL
);

Question 1: Relational model vs SQL table

State two concrete ways an SQL table differs from a Codd relation, and for each give a one-sentence practical consequence.

Answer: (1) SQL tables are multisets of tuples, not sets -- so SELECT can return duplicate rows and you must add DISTINCT or GROUP BY to recover set semantics. (2) SQL columns carry NULL -- "three-valued logic" -- so WHERE col = NULL is never true, you must use IS NULL.

Question 2: Keys

Given employee(id, email, tax_id, name, dept_id), suppose id, email, and tax_id are each individually unique. Identify all superkeys, candidate keys, and a reasonable choice of primary key; justify.

Answer: Superkeys include any set containing a candidate key: {id}, {email}, {tax_id}, {id, email}, {id, name}, {id, email, tax_id, name, dept_id}, etc. Candidate keys are the minimal superkeys: {id}, {email}, {tax_id}. Primary key: id -- a surrogate that is stable (email can change; tax_id is PII).

Question 3 (HANDS-ON SQL): Top-N-per-group

Write a single PostgreSQL query that returns, for each region, the three customers with the highest total order spend in 2025, with columns region, name, total_dollars, and rank.

Answer:

WITH spend AS (
    SELECT c.region,
           c.name,
           SUM(o.total_cents) / 100.0 AS total_dollars
    FROM customer c
    JOIN orders   o ON o.customer_id = c.id
    WHERE o.placed_at >= DATE '2025-01-01'
      AND o.placed_at <  DATE '2026-01-01'
    GROUP BY c.region, c.name
),
ranked AS (
    SELECT region, name, total_dollars,
           RANK() OVER (PARTITION BY region ORDER BY total_dollars DESC) AS rank
    FROM spend
)
SELECT region, name, total_dollars, rank
FROM ranked
WHERE rank <= 3
ORDER BY region, rank;

Solution Walkthrough: aggregate spend per (region, name); rank within region with a window function; filter rank <= 3 in an outer query because WHERE cannot reference a window alias in the same level.

Question 4 (HANDS-ON SQL): Anti-join

Write a query returning customers who have never placed an order, in two different ways (NOT EXISTS and LEFT JOIN ... IS NULL). State which you prefer and why.

Answer:

-- NOT EXISTS
SELECT c.id, c.name
FROM customer c
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

-- LEFT JOIN ... IS NULL
SELECT c.id, c.name
FROM customer c
LEFT JOIN orders o ON o.customer_id = c.id
WHERE o.id IS NULL;

Preference: NOT EXISTS -- clearer intent, the planner can short-circuit, no risk of row multiplication if the left side's predicate later changes.

Question 5: Normalization

Given enrollment(student_id, student_name, course_code, course_title, grade) and FDs student_id -> student_name, course_code -> course_title, decompose to 3NF. Identify which anomalies existed and which go away.

Answer: Decompose into student(student_id PK, student_name), course(course_code PK, course_title), enrollment(student_id FK, course_code FK, grade, PRIMARY KEY(student_id, course_code)). Anomalies eliminated: update anomaly (renaming a course had to touch every enrolled row), insertion anomaly (cannot insert a course with no students), deletion anomaly (deleting last student for a course loses the course).

Question 6: Constraints

Critique this DDL and propose improvements:

CREATE TABLE account(
    id INT,
    email TEXT,
    balance NUMERIC
);

Answer: Missing PRIMARY KEY (add id BIGSERIAL PRIMARY KEY). email should be NOT NULL UNIQUE. balance should be NOT NULL with a domain constraint (e.g., CHECK (balance >= 0) if business rule allows). Consider storing money as integer cents to avoid float / numeric precision pitfalls.

Question 7 (HANDS-ON EXPLAIN): Interpret a plan

Seq Scan on orders  (cost=0.00..234_567.89 rows=50 width=...)
                    (actual time=812..820 rows=45 loops=1)
  Filter: (customer_id = 42)
  Rows Removed by Filter: 9_999_955

Explain what is happening, why, and the single-change fix.

Answer: PostgreSQL is reading every row of orders (10 M) to find 45 matching customer_id = 42. The estimate (rows=50) was close to actual (45) so stats are fine; the real issue is there is no index on orders(customer_id). Fix: CREATE INDEX ON orders(customer_id);. Expected new plan: Index Scan using orders_customer_id_idx on orders with actual time < 1 ms.

Question 8: Online migration

You need to add a NOT NULL column country to a 200 M-row customer table. Describe the phases of the online migration and the state of the application at each phase.

Answer: (1) Expand -- ADD COLUMN country CHAR(2) NULL (metadata-only in modern PostgreSQL), deploy app version that dual-writes. (2) Backfill -- loop UPDATE ... WHERE country IS NULL LIMIT 10000 until 0 rows affected. (3) Cut-over -- add CHECK (country IS NOT NULL) NOT VALID, then VALIDATE CONSTRAINT; deploy app version that writes only the new column. (4) Contract -- in a later release, convert the CHECK to a real NOT NULL and remove the old write path.

Question 9: Isolation

Describe a "lost update" scenario at READ COMMITTED and the two standard fixes.

Answer: Session A reads balance = 100; Session B reads balance = 100, subtracts 20, writes 80. Session A, unaware, subtracts 30 and writes 70 -- B's update is lost. Fixes: (i) single-statement atomic update UPDATE account SET balance = balance - 30 WHERE id = ? AND balance >= 30; (ii) SELECT ... FOR UPDATE inside a transaction to lock the row; (iii) escalate to REPEATABLE READ / SERIALIZABLE and retry on serialization errors.

Question 10: Views vs materialized views

When is a materialized view the right answer? When is it the wrong answer?

Answer: Right answer: expensive aggregations over largely-append-only data whose readers tolerate staleness bounded by the refresh cadence (e.g., daily dashboards). Wrong answer: up-to-the-second correctness (use a plain view or an index), write-heavy data where staleness is unacceptable, or when the refresh itself cannot finish within the acceptable window.

Evidence Check

Before calling yourself done with the module:

All 10 questions answered from memory, with the hands-on queries executable against a seeded PostgreSQL database.
At least one written "why my first query was wrong" correction.
A personal mistake log organized by cluster.

Question 1: Relational model vs SQL table​

Question 2: Keys​

Question 3 (HANDS-ON SQL): Top-N-per-group​

Question 4 (HANDS-ON SQL): Anti-join​

Question 5: Normalization​

Question 6: Constraints​

Question 7 (HANDS-ON EXPLAIN): Interpret a plan​

Question 8: Online migration​

Question 9: Isolation​

Question 10: Views vs materialized views​

Evidence Check​