Skip to main content

Producing a Design Doc Worth Reviewing

What This Concept Is

A design document is the artifact that outlives the interview and the meeting. If the previous 14 concepts are executed well but never written down, the work evaporates. This concept defines the minimum sections a senior reviewer expects and what each one is for.

A review-ready design doc has these sections, in this order:

  1. Problem and context -- one paragraph. What are we designing and why now?
  2. Requirements -- functional, non-functional with numbers, constraints.
  3. Estimates -- back-of-envelope for QPS, storage, bandwidth, memory, latency budget.
  4. Hard parts -- the 2-3 aspects of the problem that drive the design.
  5. High-level design -- the box diagram + one paragraph walking the data flow.
  6. Component deep-dives -- per box: inputs, outputs, algorithm, state, SLO.
  7. Data model -- entity sketches, primary and partitioning keys, indexes.
  8. Consistency and transactions -- per mutable table: atomicity, concurrency, consistency.
  9. Scale and failure analysis -- 10×/100× walk, failure walk per box and at AZ/region.
  10. Trade-offs -- the log from Cluster 5 concept 14.
  11. Bottlenecks and SPOFs -- ranked, with fix-now vs accept-with-reason.
  12. Open questions and follow-ups -- what we did not solve; what comes next.

Missing any of sections 2, 3, 9, 10, or 11 is a deal-breaker for a senior reviewer.

Fundamentals of Software Architecture's framing: the design doc is how architectural characteristics become organizationally durable. The characteristics you committed to in section 2 (availability, latency, throughput, durability) get measured in section 3, pressure-tested in section 9, and defended as trade-offs in section 10. Without this chain, "we built a reliable system" is an unverifiable claim. The doc is the scaffolding that turns the claim into an audit trail.

A review-ready doc is also calibrated to its audience. The same underlying work can produce a 3-page engineering RFC, a 10-page ADR bundle, or a 30-slide VP-review deck -- the difference is compression, not content. What you cannot do is compress away sections 9-11 and still call the result a design doc; they are the sections where the design is falsifiable. A doc without them reads as a proposal, not an engineering artifact.

Why It Matters Here

A design doc is how engineering decisions compound into organizational knowledge.

  • Onboarding engineers read it to understand "why this shape".
  • Future architects refer to it when a change is proposed.
  • Incidents cite it when a failure mode is realized.
  • Audits use it to verify controls.

In S8M5 (Technical Leadership & Strategy) this doc becomes the root artifact of an ADR-style decision record. Writing it well here is how that module becomes navigable later.

Why This Concept Is SUPPORTING Rather Than PRIMARY

This concept formalizes what concepts 13 and 14 already cover operationally. It is a supporting concept because the shape of the doc follows mechanically from running the method in the first four clusters. If you have the artifacts from every earlier cluster, the doc is an editing exercise, not a design exercise. It is listed explicitly because many candidates execute the method and then fail to write down sections 9-11.

Concrete Example

Skeleton for the URL shortener design doc:

# URL Shortener -- Design Doc

## 1. Problem and Context
Users want to turn long URLs into short ones; the redirect is SMS-safe and
must be global and fast. We're designing this as a greenfield service...

## 2. Requirements
Functional:
- create(long_url, [options]) -> short_url
- redirect(short_code) -> 301 to long_url
- list-by-owner, delete, optional custom alias

Non-functional:
- 100M new short URLs/month (~40 writes/s avg, ~120 peak)
- Read:write ratio ~100:1; redirect P99 < 100 ms globally
- Availability 99.9%; durability of the long URL is not negotiable

Constraints:
- Short code <= 7 chars; non-guessable; case-sensitive
- Must fit inside existing AWS account; budget <= $X/mo

## 3. Estimates
- Writes/day = 3.3M, avg 40/s, peak 120/s
- Reads/day = 333M, avg 4K/s, peak 12K/s
- Storage = 3.3M * (8 + 200 + 50) B = ~1 GB/day raw, ~2 TB over 5 years
- Cache = 20% of last-30-days reads * 250 B = ~5 GB
- Latency budget for redirect P99 = 100 ms:
edge (TLS + geo) 20 | cdn lookup 5 | app 10 | cache 2 | misc 10 | return 30 -> 77; slack 23

## 4. Hard Parts
1. Unique, non-guessable 7-char codes at 40/s avg, 120/s peak, globally.
2. P99 < 100 ms globally for reads.
3. Keeping the read path cheap even as storage grows to multiple TB.

## 5. High-Level Design
[diagram]
Requests arrive at the edge, hit the CDN (redirect-cache) first...

## 6. Component Deep-Dives
### 6.1 Redirect Service
Inputs: GET /:code
Outputs: 301 to long_url; 404 if missing
Algorithm: cache lookup -> DB lookup -> 301; async log to analytics
State: stateless
SLO: P99 < 30 ms per instance

### 6.2 ID Generator
... (continue for 3-5 boxes)

## 7. Data Model
Table: shorts
partition: short_code
columns: long_url, owner_user_id, created_at, expires_at

Table: shorts_by_owner (index-as-table)
partition: owner_user_id
clustering: created_at desc

## 8. Consistency and Transactions
- shorts: strong consistency on creation (cannot double-allocate a code)
- shorts_by_owner: eventual consistency via event stream; max staleness ~seconds
- analytics events: at-least-once delivery; de-duped by (event_id)

## 9. Scale and Failure Analysis
### 10x walk: ...
### 100x walk: ...
### Per-box failure table: ...
### AZ and region failure: ...

## 10. Trade-offs
| # | Decision | Alternative | Reason | Cost |
| 1 | Wide-column KV for shorts | Sharded MySQL | Point-lookup workload; simpler to scale | Loss of ad-hoc SQL |
| 2 | CDN for hot redirects | Origin-only | 60% of traffic is to top 1% of URLs | Short staleness on edits |
| ... |

## 11. Bottlenecks and SPOFs
Bottlenecks: (ranked) ...
SPOFs: ... (each with fix-now or accept-with-reason)

## 12. Open Questions
- Custom alias quotas and abuse mitigation
- Geo-routing during partial regional outage
- Exact retention for deleted URLs under legal hold

A reviewer should be able to read sections 1-4 in two minutes and know whether the design is worth the rest of their time.

Concrete Example 2: Reviewer's Read-Through of a Real-World Chat Doc

Take a senior reviewer reading a chat-system design doc cold. Their mental loop, section by section:

  • Section 1 (context): "What problem, and is this the right moment?" Red flag if the doc starts with the solution instead of the problem.
  • Section 2 (requirements): "Are the non-functionals numeric?" Red flag if it says "highly available" without a percentage.
  • Section 3 (estimates): "Do the numbers support the architecture on page 5?" Red flag if estimates and diagram disagree on scale.
  • Section 4 (hard parts): "What actually drove the design?" Red flag if this is missing -- it means the author has not identified what is hard, so the design is probably not optimized for it.
  • Section 5 (high-level): They spend 60 seconds here, eyeballing the box diagram. Red flag if the diagram mixes levels (LB next to a raw thread-pool) or has unlabelled boxes.
  • Section 6 (deep-dives): They skim-read, looking for specific gaps -- "what is the partition key?", "what is the timeout?", "what is the SLO for this service?". Red flag if every component reads at the same depth (i.e., all shallow or all deep).
  • Section 7 (data model): Red flag if it's ER diagrams without access patterns.
  • Section 8 (consistency): Red flag if every row says "eventual consistency" or every row says "strong consistency". Either is wrong by omission.
  • Section 9 (scale/failure): This is where senior reviewers spend 30% of their reading time. Red flag if the 10× walk is missing or if "what happens if X dies" is not a literal table.
  • Section 10 (trade-offs): Red flag if fewer than 5 entries, or if any entry is missing the rejected-alternative or cost-accepted slot (see Cluster 5 concept 14).
  • Section 11 (SPOFs/bottlenecks): Red flag if nothing is listed as "accept-with-reason" -- either you've fixed everything (unlikely) or you haven't looked.
  • Section 12 (open questions): Red flag if this is empty or waved away as "TBD". A design doc with no open questions is either trivial or dishonest.

Their final verdict is driven heavily by sections 9-11. A doc with strong sections 1-4 but empty 9-11 reads as a pitch; a doc with modest sections 1-4 but thorough 9-11 reads as engineering. The latter wins in review every time.

Common Confusion / Misconceptions

"The diagram is the doc." It is not. A diagram without text cannot be reviewed asynchronously; reviewers need the reasoning.

"The doc is marketing." A design doc is not a sales pitch for the architecture. It is a critique-friendly artifact. If you are not willing to write down what your design is bad at (sections 10 and 11), the doc is incomplete.

"We don't need a doc; we'll align in a meeting." Meetings without a pre-read are either rubber-stamp or re-design. Either outcome is worse than handing out the doc 24 hours in advance.

"Doc length implies quality." It does not. A tight 5-page doc with the twelve sections above beats a 30-page wandering doc. Aim for depth in sections 9-11; compress 1-4.

"The doc is finished once the design is approved." A design doc is a living artifact. When the implementation diverges from the design, the doc gets updated -- or the divergence gets documented as an addendum. A frozen doc that no longer matches the system is worse than no doc, because it misleads.

How To Use It

Doc-production protocol:

  1. Treat sections 1-8 as transcripts of the earlier clusters' artifacts. Copy them in with light editing.
  2. Write section 9 by running the Cluster 4 walks again with the doc in front of you.
  3. Write section 10 by reading the trade-off log from Cluster 5 concept 14.
  4. Write section 11 by consolidating the stress-test output.
  5. Write section 12 last; this is where "I didn't solve X" goes, honestly.
  6. Have one peer read sections 2, 3, 9, 10, 11 and flag anything unclear. Iterate.

Transfer / Where This Shows Up Later

  • S7M5 (ADRs and reviews) uses these sections as the skeleton for linked ADRs -- major decisions in section 10 each spawn an ADR with deeper context.
  • S8M5 (technical leadership) elevates the doc from project artifact to strategic artifact: the same twelve sections map to RFCs, tech-strategy docs, and quarterly architecture reviews.
  • S9 (cloud + DevOps) turns sections 6-9 into operational automation -- deep-dive SLOs become dashboards, failure-walk cells become chaos experiments, bottleneck entries become capacity plans.
  • S9-S10 lifecycle: the doc lives past the implementation. Every postmortem links back to the section that should have predicted the failure; every change request links back to the section that will be invalidated.
  • S10 capstone: the capstone project's deliverable is exactly this doc for a non-trivial system, reviewed end-to-end by peers and a mentor. Practicing the form here is what makes that deliverable tractable.

Check Yourself

  1. Which three sections, if missing, cause a senior reviewer to reject the doc?
  2. What is the purpose of section 12 (open questions), and why should it be specific rather than vague?
  3. Why is "estimates" (section 3) a separate section from "requirements" (section 2)?
  4. How do you update a design doc after implementation reveals a different reality? What is preserved and what is revised?

Mini Drill or Application

Take one of your full walked-through designs from earlier clusters and produce the doc sections 1-12 in under 90 minutes. Focus on:

  • section 9 being walkable by a reader who never saw your whiteboard
  • section 10 reading as decisions, not essays
  • section 11 being ranked and specific

Swap with a peer. If they cannot summarize your design in 60 seconds from the doc, sections 1-4 are too long or section 5 is unclear. Iterate.

Read This Only If Stuck