The SQL Logical Pipeline: SELECT/FROM/WHERE/GROUP BY/HAVING/ORDER BY
What This Concept Is
SQL is read top-down but evaluated in a different order. The logical evaluation pipeline is:
FROM/JOIN- build the input relation by combining tables.WHERE- filter row-by-row on predicates that do not need groups.GROUP BY- partition the input into groups.HAVING- filter groups using aggregates or constants.SELECT- project the columns, evaluating expressions and aggregates.DISTINCT- deduplicate the projection (if requested).ORDER BY- order the final result (on columns or expressions available after SELECT).LIMIT/OFFSET- take a slice of the ordered result.
Knowing this order explains errors that otherwise look arbitrary:
WHERE avg(salary) > 100- illegal: aggregates do not exist atWHEREtime.SELECT dept, SUM(salary) AS total ORDER BY total- legal:ORDER BYhappens afterSELECT, so the alias is visible.SELECT dept, COUNT(*) FROM emp GROUP BY dept HAVING COUNT(*) > 5- theHAVINGclause is the only place the count is computable.
Why It Matters Here
- half of all SQL errors and every "column does not exist" surprise traces back to reading
SELECT ... FROM ...in source order instead of evaluation order - the pipeline lines up one-to-one with relational algebra:
FROMis product/join,WHEREis selection,GROUP BY+HAVINGis grouping with selection,SELECTis projection,ORDER BYis a non-relational post-pass - you will read
EXPLAINplans in Cluster 5, and plan trees mirror the pipeline
Concrete Example
Setup:
CREATE TABLE sale (
id SERIAL PRIMARY KEY,
region TEXT,
amount NUMERIC,
sold_at DATE
);
INSERT INTO sale(region, amount, sold_at) VALUES
('EU', 100, '2025-01-05'),
('EU', 200, '2025-02-12'),
('US', 150, '2025-01-20'),
('US', 900, '2025-03-15'),
('APAC', 50, '2025-01-30');
Question: "For regions whose 2025 Q1 total is over 200, list the region and total, highest first."
SELECT region, SUM(amount) AS total_q1
FROM sale
WHERE sold_at BETWEEN DATE '2025-01-01' AND DATE '2025-03-31'
GROUP BY region
HAVING SUM(amount) > 200
ORDER BY total_q1 DESC;
Expected output:
region | total_q1
--------+----------
US | 1050
EU | 300
The WHERE filter drops APAC's February rows out of the running before grouping; HAVING drops any region whose post-group total is not over 200; ORDER BY runs last, so it can use the total_q1 alias defined in SELECT.
Common Confusion / Misconception
"WHERE and HAVING are interchangeable." Only sometimes. Use WHERE for row-level filters (no aggregates allowed). Use HAVING for group-level filters (aggregates allowed). Putting a row-level filter in HAVING is legal but inefficient; putting an aggregate in WHERE is an error.
"SELECT comes first because we read it first." It is evaluated almost last. That is why column aliases from SELECT are visible in ORDER BY but not in WHERE (in standard SQL; PostgreSQL allows it in GROUP BY and HAVING too, but you should not rely on that).
"GROUP BY auto-collapses any row." It collapses rows within each group to a single output row. Every column in SELECT must be either in GROUP BY or inside an aggregate function. Anything else is ambiguous.
How To Use It
For every SQL query you write:
- Mentally run the six steps in order, and pause at each step to describe the intermediate relation's schema and cardinality.
- If an error says "column does not exist" or "not a group column," trace back to the step where that attribute was supposed to be introduced.
- Prefer row-level filtering in
WHERE(earlier) overHAVING(later) whenever the predicate does not need aggregates; this hints the optimizer.
Check Yourself
- Why is
SELECT dept, AVG(salary) FROM emp WHERE AVG(salary) > 100000 GROUP BY deptan error, and what is the minimal fix? - In a query with both
WHEREandHAVING, which one fires first, and why does it matter for performance? - Can
ORDER BYreference a column not listed inSELECT? Under what conditions?
Mini Drill or Application
Using the sale table above, write one query (no subqueries) that returns:
- for each region,
- the number of sales and the average amount,
- only for regions with at least two sales,
- ordered by region.
Then annotate every clause with the pipeline step it corresponds to, and predict the output shape before running it.