Code Review Rubrics: Correctness, Clarity, Design, Tests
What This Concept Is
A code review is not "looking for things that feel wrong." It is walking a rubric across a change so that important categories are not missed and unimportant things do not dominate. A minimal, durable rubric has four dimensions:
- Correctness -- does this code actually do the thing? Are edge cases handled? Is concurrency / error handling correct?
- Clarity -- can the next reader understand it without a tour guide? Are names, structure, and comments honest?
- Design -- does it fit the codebase's shape? Does it introduce or relieve smells? Does it respect boundaries?
- Tests -- are there tests, at the right level, that pin down the new behavior and protect future refactors?
Google's engineering practices list a similar set (design, functionality, complexity, tests, naming, comments, style, docs) -- the four-bucket version above is the condensed shape you should carry in your head.
Why It Matters Here
Without a rubric, reviews drift toward the reviewer's pet concerns. One reviewer obsesses over naming, another over performance, another over "I would have used Strategy here." The author gets inconsistent signals, and real issues in other buckets go unchecked.
A rubric is also a humane tool. It tells the author what to expect, which prevents a review from feeling like an ambush. It also lets you triage: a PR with broken correctness should not receive fifteen nit comments about variable names.
Concrete Example
A 200-line PR adds a new InvoiceExporter. A rubric-driven review pass:
| Bucket | Question | Example finding |
|---|---|---|
| Correctness | Does it handle the null currency case we saw last quarter? | "Line 42 will NPE when currency is null; reproduced by test I pasted below." |
| Clarity | Are the names and structure honest? | "helperProcess2 does two things -- consider splitting into formatRows and writeRows." |
| Design | Does this belong in InvoiceExporter or in Invoice? | "Total-computing logic leaking into the exporter -- Invoice.total() would match the pattern used by Order." |
| Tests | Is the new behavior pinned down? | "No test for the multi-currency branch; please add one based on the fixture in test_invoice_multicurrency." |
Four comments, one per bucket, each actionable.
Common Confusion / Misconception
"Review = nitpick." A review dominated by style nits is a failed review. Style should be automated with a formatter; what the rubric protects is correctness, clarity, design, and tests.
"Review = gatekeeping." Review is a design conversation between peers, not an exam. The author is responsible for the change; the reviewer is an additional pair of eyes, not the final authority on taste.
"Review = blame." It is easy to slide into "this is wrong" rather than "here is a concrete problem I see." The rubric keeps you on evidence, not judgment.
"One rubric for every PR." Adjust depth to risk. A one-line config change does not need a design-pass interrogation. A new subsystem absolutely does.
How To Use It
Keep a checklist visible while reviewing:
- Correctness: I understand what this does; I mentally ran edge cases (empty, null, concurrency, failure).
- Clarity: names, structure, and comments are honest; no "helper" / "manager" / "util" that hides a real responsibility.
- Design: fits existing patterns; does not reintroduce known smells; boundaries are respected.
- Tests: every new branch has a test; tests are at the right level (unit vs integration); tests would catch a regression, not just a typo.
Label each comment with its bucket and whether it is nit (optional polish) or substantive (must discuss before merge). Example:
[design, substantive]
PaymentProcessornow throwsRuntimeExceptionon network errors -- this used to bePaymentError. Callers relied on the typed exception for retry. Either keep the typed exception or update the two call sites I linked.
Sample Review Artifact
A real rubric-driven pass on a 180-line PR adding InvoiceExporter:
Summary: Four comments total, one per bucket. No style nits -- those are handled by the formatter.
[correctness, substantive]
exportRowsiteratesitemsbutitemscan benullwhen the invoice has no lines (see theempty_invoice_testfixture). Suggested: default to an empty list at the top, or guard the loop. Happy to paste a repro if useful.[clarity, nit]
helperProcess2does two things -- formatting and writing. Splitting intoformatRowsandwriteRowswould make the call order obvious without reading both bodies. Not a blocker.[design, substantive] Total-computing logic now lives in
InvoiceExporter, butInvoice.total()already exists and is used byOrder. Pulling this back intoInvoicekeeps the "exporter only formats" boundary intact and avoids two sources of truth for totals.[tests, substantive] New multi-currency branch has no test. The existing
test_invoice_multicurrencyfixture covers the shape you need -- adding one assertion there would pin the behavior. Without a test this will regress silently.
Notice what is missing: no "I would have done it this way," no style debate, no ambush. Four labelled, evidence-based comments, each of which the author can act on in minutes.
Check Yourself
- Which bucket is most likely to be missed when a reviewer is in a hurry?
- Why is it better to label a comment "nit" than to leave it unlabelled?
- Give a PR size or risk where the rubric depth should decrease.
- Why should style comments almost never appear in a human review?
- A PR is well-tested and clearly written, but the design choice is "fine, not great." Substantive or nit?
Mini Drill or Application
Given a 150-line PR introducing a new report format, list one expected comment per bucket. Then list two cases where you would defer a bucket (and say why).
Then take a real PR from your own repo (or a public one) and re-read it with the rubric, one bucket at a time. Count how many of your instinctive comments collapse into nits and how many are genuinely substantive. Most reviewers discover that half their comments were style noise -- the formatter should have caught those, not you.