Identifying the Hard Parts

What This Concept Is

In any system design, 80% of the surface area is boilerplate: CRUD endpoints, auth, request logging, rate limits, stateless app servers. The remaining 20% is where the interesting design lives. Identifying the hard parts means naming, before you draw anything, the two or three specific aspects of this problem that do not follow from a template.

Hard parts usually fall into one of these categories:

Data shape and volume: huge hot keys, multi-PB storage, high-fanout writes, high-fanout reads.
Latency budget: sub-100 ms globally, tight real-time deadlines, or long critical paths through multiple services.
Consistency and ordering: strong consistency on a partitioned dataset, causal ordering under concurrent edits, exactly-once semantics.
Skew and fairness: hot celebrities, top-K leaderboards, uneven shard load, abuse and spam.
Failure and durability: cannot lose writes, must survive AZ loss, ordering must hold across failovers.
Cost structure: what would bankrupt a naïve implementation.

Your job in this step is to produce a ranked list of 2-3 of these, before you start designing.

Another lens: the hard parts are where trade-offs stop cancelling out. On the easy 80%, you can pick any reasonable option (Postgres or MySQL; any of three caching libraries) and get comparable results. On the hard 20%, choosing A over B has compound consequences -- it rules out later options, forces specific failure modes, and locks in a cost curve. Richards & Ford call these architecturally significant decisions; the rest are implementation decisions. The goal of this step is to route scarce review time to the first kind.

Why It Matters Here

Hard parts are where the trade-offs live. The rest of the design can be templated.

If you do not name the hard parts first:

You spend the deep dive on the easy parts and run out of time on the hard ones.
You pick a storage model that cannot accommodate the hard part (e.g., strong consistency on a celebrity fan-out).
The interviewer asks "what's hard about this for you?" and you have no answer.
Your stress test in Cluster 4 re-discovers them under pressure instead of before.

Naming the hard parts is how you control the rest of the session.

Concrete Example: News Feed

Prompt: "Design a news feed." (Twitter/Facebook style.)

Boilerplate parts (easy, <10 min total across the whole design):

Users, posts, follows: standard entities in any OLTP store.
Auth: OIDC or session cookies; assume solved.
Stateless app servers behind a load balancer: obvious.
Image and video storage: object store (S3) with a CDN; solved.

Hard parts (your ranked list):

Fan-out on write vs fan-out on read under skew. A celebrity with 50 M followers cannot have every post fanned out synchronously at write time. A read-fanout approach makes feed loads slow. The design must handle both ends.
Feed ranking on a 1 B-user graph with freshness and engagement signals, at sub-200 ms latency, on a dataset that is constantly changing.
Write amplification and storage cost from fan-out: 1 post × 500 followers means 500 timeline entries; across 10 B posts/day that is a trillion rows/day.

Now you have a design problem to solve, not a vocabulary test. Cluster 3 (deep dive), Cluster 4 (stress test), and Cluster 5 (trade-offs) will all hinge on these three.

Compare: "Design a URL shortener." The hard parts are different:

Generating a short, non-guessable identifier that is unique at 10⁸+ scale without coordination.
Reads dominate writes by 100:1; latency target is under 100 ms globally; cache is the architecture.
Storage is small (a few TB over years), but the redirect hot path must never block on a cache miss.

Same methodology, different hard parts.

Prompt: "Design Uber/Lyft matching and trip backend."

Boilerplate: user accounts, card-on-file, trip history CRUD, review storage, standard mobile notifications. All boring.

Hard parts:

Geospatial proximity search at 10 K writes/sec of driver locations, with 1-second freshness, across a global user base. A naïve (lat, lng) index collapses; you need geo-sharding (h3/s2 hex cells), driver-location streams in an in-memory grid, and a fast top-K by distance per request.
Matching consistency: two riders must not be promised the same driver; a driver cannot be double-booked. This is a real-time concurrency problem on a partitioned, continuously changing dataset.
Surge pricing correctness under bursty demand: price must reflect current supply/demand in a region, update within seconds, and be auditable (post-trip disputes, regulatory review).

Notice how the hard parts route directly to specific later decisions: #1 drives the data model in Cluster 3; #2 drives the consistency contract; #3 drives the event-processing pipeline and the durability tier. Without this list, each of those decisions would be guessed in isolation.

Common Confusion / Misconceptions

"Everything is hard." No. If you cannot rank which part is hardest, you have not framed the problem. Interviewers mark this as "spent time everywhere, solved nothing deeply".

"The hard part is the technology." The hard part is almost never the technology choice. "Should we use Kafka or Kinesis" is not a hard part; "is the stream ordered per key, and how do we recover if the consumer falls behind" is a hard part.

"I will find the hard parts as I go." You will find some. You will miss the others until the reviewer asks. Fifteen minutes of up-front analysis saves forty minutes of flailing later.

"Hard parts are always performance-related." Often not. Data-privacy constraints, regulatory audit trails, multi-tenancy isolation, and cross-region data-residency can be the hardest part of a system that has modest QPS. "Per-tenant encryption-at-rest with per-tenant key rotation" is architecturally significant regardless of load.

How To Use It

After framing and estimation, spend five minutes on this exercise:

Write down the numbers from Cluster 1. Which of them cross a power-of-two threshold that normal templates do not handle?
Ask: where is the data skewed? Any hot key, hot user, hot region, or hot tenant?
Ask: where is the fan-out? 1 write causes how many effective writes? 1 read causes how many effective reads?
Ask: where is the ordering / consistency contract subtle? What happens if two updates race?
Ask: where is the durability bar highest, and where is it lowest? (Not all data is equal.)
Pick 2-3. Rank them. State why each is hard in one sentence.

This becomes the agenda for your deep dive.

Transfer / Where This Shows Up Later

Cluster 3 (deep dive) is agenda-driven by this list: the components you zoom into are the ones where the hard parts live.
Cluster 4 (stress test) targets the hard parts first -- that is where 10×/100× failures will surface.
S8M4 (scale/reliability/performance) works exclusively on hard parts. Performance tuning on easy parts is waste.
S8M5 (technical leadership) uses "what is hard?" as the opening question of every architecture review; it is the single highest-signal prompt you can give a team.
S9/S10 interviews and capstone reviews grade candidates heavily on whether they named the hard parts before drawing. A design that solves the easy parts elegantly while ignoring the hard ones is the archetypal mid-level failure mode.

Check Yourself

In a rate limiter, what is the hard part: the counter store, the distribution of counters, or the exact-limit semantics under concurrent requests? Why?
In a distributed cache, what is the hard part: eviction, consistency with the source-of-truth, or hot keys?
In a chat system, what is the hard part: storing messages, delivering them in order, or fan-out to offline devices?
Give an example of an architecturally significant hard part that has nothing to do with QPS.

For each, argue the ranking in one sentence.

Mini Drill or Application

For each prompt, produce in five minutes a ranked list of 2-3 hard parts, each with a one-sentence reason:

Design a global leaderboard for a mobile game with 100 M DAU.
Design a multi-region write-replicated key-value store.
Design a real-time collaborative document editor.
Design a webhook delivery service for 100 K tenants with per-tenant retry policies.

Then identify which of the hard parts is the one you would deep-dive first and why.

Read This Only If Stuck

System Design Primer: How to approach a system design interview -- the "Step 1" framing prompt maps onto this concept.
System Design Primer: System design interview questions -- prompt catalog for practising hard-part identification.
Fundamentals: Identifying architectural characteristics -- ranking characteristics is the same move at the whole-system level.
Fundamentals: Analyzing trade-offs -- why hard parts are exactly where trade-offs concentrate.
Fundamentals: Architectural thinking -- the mindset of scanning for architectural significance.
ByteByteGo: A framework for system design interviews (Alex Xu) -- the Step 3 "design deep dive" section describes exactly the hard-part zoom.
Martin Fowler -- Catalog of Patterns of Distributed Systems -- recognizing known patterns is how senior engineers quickly isolate what is actually hard versus what has a named solution.
High Scalability -- case studies are a library of "what turned out to be the hard part" stories.

What This Concept Is​

Why It Matters Here​

Concrete Example: News Feed​

Concrete Example 2: Ride-Sharing Backend​

Common Confusion / Misconceptions​

How To Use It​

Transfer / Where This Shows Up Later​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​