The SQL Logical Pipeline: SELECT/FROM/WHERE/GROUP BY/HAVING/ORDER BY

What This Concept Is

SQL is read top-down but evaluated in a different order. The logical evaluation pipeline is:

FROM / JOIN - build the input relation by combining tables.
WHERE - filter row-by-row on predicates that do not need groups.
GROUP BY - partition the input into groups.
HAVING - filter groups using aggregates or constants.
SELECT - project the columns, evaluating expressions and aggregates.
DISTINCT - deduplicate the projection (if requested).
ORDER BY - order the final result (on columns or expressions available after SELECT).
LIMIT / OFFSET - take a slice of the ordered result.

Knowing this order explains errors that otherwise look arbitrary:

WHERE avg(salary) > 100 - illegal: aggregates do not exist at WHERE time.
SELECT dept, SUM(salary) AS total ORDER BY total - legal: ORDER BY happens after SELECT, so the alias is visible.
SELECT dept, COUNT(*) FROM emp GROUP BY dept HAVING COUNT(*) > 5 - the HAVING clause is the only place the count is computable.

Why It Matters Here

half of all SQL errors and every "column does not exist" surprise traces back to reading SELECT ... FROM ... in source order instead of evaluation order
the pipeline lines up one-to-one with relational algebra: FROM is product/join, WHERE is selection, GROUP BY+HAVING is grouping with selection, SELECT is projection, ORDER BY is a non-relational post-pass
you will read EXPLAIN plans in Cluster 5, and plan trees mirror the pipeline

Concrete Example

Setup:

CREATE TABLE sale (
    id         SERIAL PRIMARY KEY,
    region     TEXT,
    amount     NUMERIC,
    sold_at    DATE
);
INSERT INTO sale(region, amount, sold_at) VALUES
    ('EU',  100, '2025-01-05'),
    ('EU',  200, '2025-02-12'),
    ('US',  150, '2025-01-20'),
    ('US',  900, '2025-03-15'),
    ('APAC', 50, '2025-01-30');

Question: "For regions whose 2025 Q1 total is over 200, list the region and total, highest first."

SELECT region, SUM(amount) AS total_q1
FROM sale
WHERE sold_at BETWEEN DATE '2025-01-01' AND DATE '2025-03-31'
GROUP BY region
HAVING SUM(amount) > 200
ORDER BY total_q1 DESC;

Expected output:

 region | total_q1
--------+----------
 US     | 1050
 EU     |  300

The WHERE filter drops APAC's February rows out of the running before grouping; HAVING drops any region whose post-group total is not over 200; ORDER BY runs last, so it can use the total_q1 alias defined in SELECT.

Common Confusion / Misconception

"WHERE and HAVING are interchangeable." Only sometimes. Use WHERE for row-level filters (no aggregates allowed). Use HAVING for group-level filters (aggregates allowed). Putting a row-level filter in HAVING is legal but inefficient; putting an aggregate in WHERE is an error.

"SELECT comes first because we read it first." It is evaluated almost last. That is why column aliases from SELECT are visible in ORDER BY but not in WHERE (in standard SQL; PostgreSQL allows it in GROUP BY and HAVING too, but you should not rely on that).

"GROUP BY auto-collapses any row." It collapses rows within each group to a single output row. Every column in SELECT must be either in GROUP BY or inside an aggregate function. Anything else is ambiguous.

How To Use It

For every SQL query you write:

Mentally run the six steps in order, and pause at each step to describe the intermediate relation's schema and cardinality.
If an error says "column does not exist" or "not a group column," trace back to the step where that attribute was supposed to be introduced.
Prefer row-level filtering in WHERE (earlier) over HAVING (later) whenever the predicate does not need aggregates; this hints the optimizer.

Check Yourself

Why is SELECT dept, AVG(salary) FROM emp WHERE AVG(salary) > 100000 GROUP BY dept an error, and what is the minimal fix?
In a query with both WHERE and HAVING, which one fires first, and why does it matter for performance?
Can ORDER BY reference a column not listed in SELECT? Under what conditions?

Mini Drill or Application

Using the sale table above, write one query (no subqueries) that returns:

for each region,
the number of sales and the average amount,
only for regions with at least two sales,
ordered by region.

Then annotate every clause with the pipeline step it corresponds to, and predict the output shape before running it.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​