Subqueries, CTEs, and Window Functions
What This Concept Is
SQL composes queries in three ways. Each solves a class of problems the others cannot solve cleanly.
Subqueries - a SELECT inside another statement. Three flavors:
- Scalar subquery: returns one column, one row. Usable anywhere a value is expected.
SELECT name, (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.id) AS n FROM customer c. IN/EXISTSsubquery: returns a set for membership tests (semi/anti joins).- Derived table (subquery in
FROM): returns a relation, given a name withAS.SELECT x.region, x.total FROM (SELECT region, SUM(amount) AS total FROM sale GROUP BY region) x WHERE x.total > 1000.
CTEs (Common Table Expressions) - WITH name AS (SELECT ...) SELECT .... Named subqueries that:
- make long queries readable by pulling each step out;
- can chain:
WITH a AS (...), b AS (... FROM a) SELECT ... FROM b; - can be recursive (
WITH RECURSIVE) for hierarchies and graph walks; - can always be inlined into the main query (modulo recursion and materialization hints).
Window functions - aggregates that do not collapse rows. SELECT name, salary, AVG(salary) OVER (PARTITION BY dept) AS dept_avg FROM employee. For each row, the window specification defines a frame of other rows; the function is computed over that frame. Common functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM, AVG, NTILE.
Why It Matters Here
- any "top N per group" or "percentage of group total" question is a one-liner with window functions and a pyramid of pain without them
- CTEs are the only reasonable way to write a query more than 20 lines long
- recursive CTEs are how you walk trees and graphs inside the database, without an application round-trip per level
- scalar subqueries in
SELECTare easy to write and easy to make slow; knowing when they become correlated loops is part of the Cluster-5 performance story
Concrete Example
Schema:
CREATE TABLE employee(id INT, dept TEXT, salary NUMERIC);
INSERT INTO employee VALUES
(1,'Eng',120000), (2,'Eng',95000), (3,'Eng',150000),
(4,'Sales',110000), (5,'Sales',90000);
Question: "For each department, list employees ranked by salary and each one's share of the department total."
With a subquery per row (ugly):
SELECT e.id, e.dept, e.salary,
(SELECT SUM(salary) FROM employee e2 WHERE e2.dept = e.dept) AS dept_total
FROM employee e;
With a CTE + window functions (clean):
WITH ranked AS (
SELECT id, dept, salary,
RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS dept_rank,
SUM(salary) OVER (PARTITION BY dept) AS dept_total
FROM employee
)
SELECT id, dept, salary, dept_rank,
ROUND(100.0 * salary / dept_total, 1) AS pct_of_dept
FROM ranked
ORDER BY dept, dept_rank;
Expected output:
id | dept | salary | dept_rank | pct_of_dept
----+-------+--------+-----------+-------------
3 | Eng | 150000 | 1 | 41.1
1 | Eng | 120000 | 2 | 32.9
2 | Eng | 95000 | 3 | 26.0
4 | Sales | 110000 | 1 | 55.0
5 | Sales | 90000 | 2 | 45.0
Recursive CTE example - walk an org chart from the CEO down:
WITH RECURSIVE tree AS (
SELECT id, manager_id, name, 1 AS depth FROM employee WHERE manager_id IS NULL
UNION ALL
SELECT e.id, e.manager_id, e.name, t.depth + 1
FROM employee e JOIN tree t ON e.manager_id = t.id
)
SELECT * FROM tree ORDER BY depth, name;
Common Confusion / Misconception
"A CTE always materializes." In PostgreSQL 12+ the optimizer may inline non-recursive CTEs. If you need materialization as an optimization fence, write WITH x AS MATERIALIZED (...).
"Window functions replace GROUP BY." They are complementary. GROUP BY collapses rows into one per group. Windows add a column computed over a frame while keeping all rows.
"Correlated subqueries are always slow." Not always; they can become nested-loop joins or be transformed by the optimizer into semi-joins. But any correlated subquery inside a SELECT list deserves a second look (see Concept 14 on N+1).
How To Use It
When writing a non-trivial query:
- Write the algebra sketch first (Concept 3).
- If any step is naturally "compute X once per group but keep the rows," use a window function.
- If the query has more than two steps, use a CTE per step. Readability beats cleverness.
- Use correlated subqueries only when an equivalent join or semi-join would distort cardinality.
- Reach for recursive CTEs for tree/graph walks; use an iteration fence (
depth < N) to avoid runaways on cyclic data.
Check Yourself
- Rewrite "top 3 highest-paid employees per department" using
ROW_NUMBERand a CTE. - Give one case where a scalar subquery is cleaner than a join.
- What is the difference between
RANK,DENSE_RANK, andROW_NUMBERwhen salaries tie?
Mini Drill or Application
Given sale(region, amount, sold_at):
- Using window functions, produce a running year-to-date total per region by sale date.
- Using a CTE, produce the top-2 highest-grossing days per region.
- Using
WITH RECURSIVE, givenparent(child, parent), produce the full ancestor chain for a given node.
Read This Only If Stuck
- Silberschatz: Nested subqueries, part 1
- Silberschatz: Nested subqueries, part 2
- Silberschatz: Nested subqueries, part 3
- Silberschatz: Recursive queries, part 1
- Silberschatz: Advanced aggregation features, part 1 (windows)
- Silberschatz: Advanced aggregation features, part 2
- PostgreSQL docs: Window functions tutorial
- PostgreSQL docs: WITH queries (CTEs)