Skip to main content

Denormalization: When and Why

What This Concept Is

Denormalization is the deliberate introduction of redundancy into an otherwise-normalized schema, usually to reduce read cost. It trades write complexity and integrity risk for query simplicity or speed. It is not "schema design when you are lazy"; it is an engineering decision applied after normalization, with explicit justification.

Common forms:

  • Replicated columns: copy customer.name into orders.customer_name to avoid a join on every order display.
  • Precomputed aggregates / counter columns: keep post.comment_count updated by triggers or application logic so you do not COUNT(*) on every page load.
  • Materialized views (Concept 12): persist the result of a query; refresh on schedule or on write.
  • Pre-joined tables for analytics (star / snowflake schemas in a data warehouse).
  • Array / JSON columns: collapse a one-to-many into one row when the "many" side is small and rarely queried on its own.

Why It Matters Here (connection to primary)

Concept 7 taught how to normalize. This concept is the counterweight: why you would not do the cleanest thing, and how to stay safe when you do not.

The pattern is:

  1. Normalize first. Write the normalized schema. Write the queries. Measure.
  2. If reads are too expensive, try indexes and query rewrites (Clusters 4 and 5).
  3. Only then consider denormalization, and only with an explicit plan for keeping the redundant data consistent.

Concrete Example

Normalized:

CREATE TABLE post(id INT PRIMARY KEY, title TEXT, author_id INT);
CREATE TABLE comment(id INT PRIMARY KEY, post_id INT REFERENCES post(id));

SELECT p.id, p.title, COUNT(c.id) AS n_comments
FROM post p LEFT JOIN comment c ON c.post_id = p.id
GROUP BY p.id, p.title;

If your front page loads 100 posts, every visit runs a grouped outer join. On a million-comment table that may be slow.

Denormalized with a counter column:

ALTER TABLE post ADD COLUMN comment_count INT NOT NULL DEFAULT 0;

-- maintenance: increment on insert, decrement on delete
CREATE OR REPLACE FUNCTION bump_post_comment_count() RETURNS TRIGGER AS $$
BEGIN
IF (TG_OP = 'INSERT') THEN
UPDATE post SET comment_count = comment_count + 1 WHERE id = NEW.post_id;
ELSIF (TG_OP = 'DELETE') THEN
UPDATE post SET comment_count = comment_count - 1 WHERE id = OLD.post_id;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER comment_count_trg
AFTER INSERT OR DELETE ON comment
FOR EACH ROW EXECUTE FUNCTION bump_post_comment_count();

Now the front page is SELECT id, title, comment_count FROM post, but you own the truth-keeping.

Common Confusion / Misconception

"Denormalize by default; joins are slow." Joins on indexed keys are usually cheap. Database engines are excellent at them. The expensive thing is usually not the join but the scan or the sort you did not notice. Fix those first.

"Once I denormalize, the redundancy is maintained automatically." Not unless you use a mechanism that enforces it (trigger, materialized view with automatic refresh, a GENERATED ... STORED column). Hand-maintained redundancy is a bug waiting to ship.

"Analytics tables should be denormalized by default." For offline analytical workloads (data warehouses), yes - star/snowflake schemas are the norm. For transactional systems (OLTP), no - denormalize only specific hot paths.

How To Use It

Denormalize only when all of these are true:

  1. You have measured the normalized path and identified a specific, reproducible bottleneck.
  2. The redundant value is derivable from other tables (never only in the denormalized column).
  3. You have a plan (trigger, generated column, scheduled job, dual-write, materialized view) to keep it consistent, and a recovery plan when it drifts.
  4. You have written a reconciliation query you run on a schedule, or at least during incident investigation.

Document every denormalization in a schema-notes file: what is duplicated, from where, by what mechanism, and how to rebuild if drift is detected.

Check Yourself

  1. Name one failure mode for a denormalized counter column and one safeguard.
  2. Why is a GENERATED ... STORED column safer than a trigger-maintained column in most cases?
  3. When is a materialized view a better choice than adding a column?

Mini Drill or Application

Take your 3NF decomposition of the booking schema from Concept 7. Identify two likely hot read paths (for example, "search bookings by hotel city over a date range"). For each:

  1. Propose one denormalization that would speed it up.
  2. Write the maintenance mechanism (trigger, generated column, or refresh job).
  3. Write the reconciliation query that would catch drift.

Read This Only If Stuck