Subqueries, CTEs, and Window Functions

What This Concept Is

SQL composes queries in three ways. Each solves a class of problems the others cannot solve cleanly.

Subqueries - a SELECT inside another statement. Three flavors:

Scalar subquery: returns one column, one row. Usable anywhere a value is expected. SELECT name, (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.id) AS n FROM customer c.
IN / EXISTS subquery: returns a set for membership tests (semi/anti joins).
Derived table (subquery in FROM): returns a relation, given a name with AS. SELECT x.region, x.total FROM (SELECT region, SUM(amount) AS total FROM sale GROUP BY region) x WHERE x.total > 1000.

CTEs (Common Table Expressions) - WITH name AS (SELECT ...) SELECT .... Named subqueries that:

make long queries readable by pulling each step out;
can chain: WITH a AS (...), b AS (... FROM a) SELECT ... FROM b;
can be recursive (WITH RECURSIVE) for hierarchies and graph walks;
can always be inlined into the main query (modulo recursion and materialization hints).

Window functions - aggregates that do not collapse rows. SELECT name, salary, AVG(salary) OVER (PARTITION BY dept) AS dept_avg FROM employee. For each row, the window specification defines a frame of other rows; the function is computed over that frame. Common functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM, AVG, NTILE.

Why It Matters Here

any "top N per group" or "percentage of group total" question is a one-liner with window functions and a pyramid of pain without them
CTEs are the only reasonable way to write a query more than 20 lines long
recursive CTEs are how you walk trees and graphs inside the database, without an application round-trip per level
scalar subqueries in SELECT are easy to write and easy to make slow; knowing when they become correlated loops is part of the Cluster-5 performance story

Concrete Example

Schema:

CREATE TABLE employee(id INT, dept TEXT, salary NUMERIC);
INSERT INTO employee VALUES
  (1,'Eng',120000), (2,'Eng',95000), (3,'Eng',150000),
  (4,'Sales',110000), (5,'Sales',90000);

Question: "For each department, list employees ranked by salary and each one's share of the department total."

With a subquery per row (ugly):

SELECT e.id, e.dept, e.salary,
       (SELECT SUM(salary) FROM employee e2 WHERE e2.dept = e.dept) AS dept_total
FROM employee e;

With a CTE + window functions (clean):

WITH ranked AS (
    SELECT id, dept, salary,
           RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS dept_rank,
           SUM(salary) OVER (PARTITION BY dept) AS dept_total
    FROM employee
)
SELECT id, dept, salary, dept_rank,
       ROUND(100.0 * salary / dept_total, 1) AS pct_of_dept
FROM ranked
ORDER BY dept, dept_rank;

Expected output:

 id | dept  | salary | dept_rank | pct_of_dept
----+-------+--------+-----------+-------------
| Eng   | 150000 |         1 | 41.1
| Eng   | 120000 |         2 | 32.9
| Eng   |  95000 |         3 | 26.0
| Sales | 110000 |         1 | 55.0
| Sales |  90000 |         2 | 45.0

Recursive CTE example - walk an org chart from the CEO down:

WITH RECURSIVE tree AS (
    SELECT id, manager_id, name, 1 AS depth FROM employee WHERE manager_id IS NULL
    UNION ALL
    SELECT e.id, e.manager_id, e.name, t.depth + 1
    FROM employee e JOIN tree t ON e.manager_id = t.id
)
SELECT * FROM tree ORDER BY depth, name;

Common Confusion / Misconception

"A CTE always materializes." In PostgreSQL 12+ the optimizer may inline non-recursive CTEs. If you need materialization as an optimization fence, write WITH x AS MATERIALIZED (...).

"Window functions replace GROUP BY." They are complementary. GROUP BY collapses rows into one per group. Windows add a column computed over a frame while keeping all rows.

"Correlated subqueries are always slow." Not always; they can become nested-loop joins or be transformed by the optimizer into semi-joins. But any correlated subquery inside a SELECT list deserves a second look (see Concept 14 on N+1).

How To Use It

When writing a non-trivial query:

Write the algebra sketch first (Concept 3).
If any step is naturally "compute X once per group but keep the rows," use a window function.
If the query has more than two steps, use a CTE per step. Readability beats cleverness.
Use correlated subqueries only when an equivalent join or semi-join would distort cardinality.
Reach for recursive CTEs for tree/graph walks; use an iteration fence (depth < N) to avoid runaways on cyclic data.

Check Yourself

Rewrite "top 3 highest-paid employees per department" using ROW_NUMBER and a CTE.
Give one case where a scalar subquery is cleaner than a join.
What is the difference between RANK, DENSE_RANK, and ROW_NUMBER when salaries tie?

Mini Drill or Application

Given sale(region, amount, sold_at):

Using window functions, produce a running year-to-date total per region by sale date.
Using a CTE, produce the top-2 highest-grossing days per region.
Using WITH RECURSIVE, given parent(child, parent), produce the full ancestor chain for a given node.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​