Vertical vs Horizontal Scaling: The When and Why

What This Concept Is

Two ways to make a system serve more load:

Vertical scaling ("scale up"): give the existing machine more resources - more CPU cores, more RAM, faster disk, bigger NIC. One bigger box.
Horizontal scaling ("scale out"): run more machines in parallel behind a load balancer. Many smaller boxes.

A service is scalable when adding resources produces proportional capacity. Werner Vogels's 2006 definition (still the industry reference) is exact: "a service is scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added." Vogels adds two properties most teams forget: (1) scaling can mean serving more units of work or larger units of work (data size, not just request rate); (2) "adding resources to facilitate redundancy does not result in a loss of performance" is itself a scalability requirement - if replication slows you down, you are not scalable in the full sense.

Vertical scaling is what you do first; horizontal scaling is what you do when the first stops working. The two are not interchangeable; they have different cost curves, different failure modes, different operational costs, and different software preconditions.

Vogels's second warning applies directly: "scalability cannot be an after-thought. It requires applications and platforms to be designed with scaling in mind." Bolting horizontal scaling onto an architecture that was not designed for it is usually the single largest engineering cost in growing infrastructure.

Why It Matters Here

Almost every architecture conversation at scale is really a conversation about which parts of the system can be scaled horizontally and which are stuck with vertical scaling (or worse: can't be scaled at all without a redesign).

The practical rule: the stateless tier scales horizontally; the stateful tier scales vertically until you shard. Rewriting a stateful system to partition cleanly is the single largest engineering cost in growing infrastructure, so early choices about statelessness (next concept) have outsized effect.

This concept also connects directly to Amdahl/USL: horizontal scaling's ceiling is dictated by the serial fraction and the coherence penalty. Vertical scaling's ceiling is dictated by the largest machine your cloud sells you. And both ceilings arrive sooner than most plans assume - on AWS in 2025 the largest general-purpose EC2 instance (u7i-12tb.224xlarge) has 896 vCPUs and 12 TiB RAM; you will pay tens of thousands of dollars per month for it; and when it reboots, your service goes with it.

Vogels also highlights heterogeneity as a scaling hazard: "resources in the system increase in diversity as next generations of hardware come on line." A horizontally scaled fleet drifts over time - new instances, new CPU families, new kernels. Load balancers that assume homogeneous backends (e.g., round-robin with identical timeouts) silently underperform when half the fleet is on a faster generation.

Concrete Example

A SaaS application is running on one 8-core / 32GB virtual machine and handling 1,000 req/s at 60% CPU.

Vertical path.

Step 1: upgrade to 16-core / 64GB. Probably get ~1,800 req/s (not 2,000x because of Amdahl - some of the work is single-threaded).
Step 2: upgrade to 64-core / 256GB. Maybe 4,000 req/s. Cost has grown roughly 8x; capacity 4x.
Step 3: you have now bought the biggest machine the cloud offers. You are at the ceiling.
Failure mode: that one box is a single point of failure. When it reboots, you are down. Maintenance windows require the whole service to be taken offline.
Deploy mode: every deploy is a restart of the one box. There is no rolling deploy; there is a small outage every release.

Horizontal path.

Step 1: put the 8-core / 32GB box behind a load balancer. Add a second identical box. Now ~1,900 req/s (small coordination overhead).
Step 2: add eight more. ~8,000 req/s assuming the service is stateless and the DB can keep up. The Amdahl/USL analysis from the previous concept governs where this stops.
Cost: 10 * $small is usually cheaper than 1 * $big. Cloud pricing is roughly linear in vCPUs for small/medium instances and superlinear for the largest ones.
Failure mode: one box dying removes 10% of capacity, not 100%. If latency p99 is OK at 90% capacity, you did not page anyone.
Deploy mode: rolling deploy is free - deploy 1 of N, verify, deploy the rest. A bad deploy affects 1/N of traffic for the duration of the canary.

Data-tier corollary. In the same app, the database went from 1 primary to 1 primary + 2 replicas to sharded by customer_id across 4 shards. Each step reduced one form of contention: read load first (replicas absorb reads), then write contention (shards split the write path). Sharding is horizontal scaling for data; it imports its own cost (re-sharding is expensive, cross-shard queries are expensive).

Cost-curve arithmetic. On AWS (2025 on-demand), a c7i.large (2 vCPU, 4 GiB) is roughly $0.085/hr; a c7i.48xlarge (192 vCPU, 384 GiB) is roughly $8.16/hr. For the same vCPU-hour, the big box costs about 0.99x the small box - essentially linear in vCPU for on-demand until you reach the metal instances (c7i.metal-48xl, $8.16/hr), after which you pay for reserved capacity premiums. Horizontal is not always cheaper per core; the savings come from packing density, spot instances, and the ability to mix instance families. The right analysis is "total cost of SLO delivery over 6 months," not "price per vCPU today."

Common Confusion / Misconception

"Horizontal scaling is always better." Not always. Vertical scaling is simpler to operate (one machine, no load-balancing, no data distribution, no cache coherency), and for small/medium services it is cheaper per request. The right answer for a 100 req/s internal tool may be "one big box, nightly backup." The right answer for a 100,000 req/s public API is "many small boxes across three AZs with anycast DNS."

"We'll just scale horizontally when we need to." That works only if the service is stateless. If the service holds in-memory session data, local caches, or a local database, "just add more" means "silently produce incorrect behavior." The precondition for cheap horizontal scaling is statelessness (next concept) and a data layer that was already designed to partition.

"The cloud makes this trivial - auto-scaling handles it." Auto-scaling handles the mechanics of adding/removing identical stateless instances. It does not handle the architectural preconditions. An auto-scaler pointed at a stateful tier is an outage generator.

"Vertical scaling has no architectural cost - just pick a bigger instance." False. Larger NUMA nodes have more complex memory-locality effects. Larger heap sizes tax your GC. Larger network interfaces can expose assumptions about single-threaded packet processing. A 96-core box is not a drop-in replacement for a 16-core box.

"Horizontal scaling needs 3 AZs, end of story." Nearly true, but not the whole story. Three AZs plus shuffle-sharding (randomly assigning customers to subsets of your fleet) gives much stronger blast-radius guarantees than a flat 3-AZ fleet. AWS uses this pattern for Route 53 explicitly. The AKF Scale Cube's Z-axis (split by customer) is the generalization.

"Auto-scaling removes the need to plan for peaks." Auto-scalers have lag (30s-5min to bring capacity online), quotas (cloud account limits, instance-type availability), and a cold-cache penalty on every new instance. Plan for the peak anyway; auto-scaling is how you recover from the mispredicted peak, not how you avoid planning.

How To Use It

For every tier of your system, ask:

Is this tier stateless? If yes, horizontal is almost certainly the right answer. If no, why not? Can the state be moved out (to a session store, cache, or DB)?
What does the cost curve look like at 2x, 10x, 100x load? Vertical scaling costs jump superlinearly near the top of the instance-type ladder. Plot actual cloud prices against vCPU count - the curve is not a straight line.
What is the blast radius of one failed instance? Vertical: the whole service. Horizontal: 1/N of capacity, briefly.
What is the operational cost? Horizontal requires a load balancer, health checks, auto-scaling, and a deploy pipeline that rolls. Vertical requires careful capacity planning and large-VM reboot choreography.
How does redundancy interact with scaling? Per Vogels, adding replicas for reliability should not hurt performance. If your plan is "scale vertically and keep a cold standby," you have not horizontally scaled - you have paid for two giant boxes and can only use one at a time.
Identify the "cannot be scaled further" component. Usually the primary DB, a shared cache cluster, a queue broker, or a single-writer index. Plan the redesign (sharding, read replicas, eventual consistency) before you hit the wall, not after.

Check Yourself

Why is the cost per request usually lower on many medium instances than on one large one?
Name two concrete operational costs horizontal scaling adds that vertical scaling does not.
A database primary is the classic "cannot scale horizontally without effort" example. What specific property makes that true?
Vogels says adding redundancy should not cost performance. Give one concrete pattern where redundancy does cost performance and explain how to avoid it.
Why does heterogeneity (different instance generations in the same fleet) break naive load-balancing assumptions?
Give one workload where vertical scaling is the correct long-term answer and explain why horizontal is not worth the complexity.
What does "shuffle-sharding" add to a flat 3-AZ horizontally-scaled fleet, and why does AWS Route 53 use it?

Mini Drill or Application

Take a service you know. For each tier (web, API, cache, database, queue), write one sentence of the form "This tier is currently scaled _, can be scaled _, and the thing blocking further horizontal scaling is _." If the third blank is vague, you have not yet identified the real constraint. Now write the "cost curve" for one tier: what does 2x, 10x, 100x load cost in the current architecture vs the redesigned one? The delta is the engineering budget you can justify.

Extension drill. For the single-most-stateful tier in the system you picked (usually the primary DB), walk through three horizontal-scaling options in order of effort: (1) read replicas, (2) functional decomposition (split into two DBs along a bounded context), (3) sharding by a stable key. For each, write the one-line cost, the one-line benefit, and the one-line failure mode you introduce. Three paragraphs; 30 minutes. The output is a defensible roadmap for the next 12 months of data-tier work.

Transfer / Where This Shows Up Later

This choice shapes everything downstream. The "stateless tier horizontal, stateful tier vertical until you shard" rule is one of the most durable pieces of advice in the module, but its implications appear repeatedly:

This module, concept 05 (statelessness): the precondition that makes horizontal scaling actually work. If your tier is not stateless, "scale horizontally" is not a plan.
This module, concept 06 (caching): caches are how stateful data tiers survive horizontal scaling on the read path; sharding is how they survive on the write path.
This module, concept 12 (capacity planning): your cost curve depends on which path you chose. A 10x vertical plan and a 10x horizontal plan have different slopes and different break-even points.
S8 M2 (microservices): decomposition is the architectural move that lets you scale different parts of the business differently - cart horizontally, ledger vertically, search sharded.
S8 M3 (event-driven): async messaging is the lubricant that keeps horizontal scaling honest across the stateful boundary; without it, every write is an N-to-1 synchronous call on the one shared DB.
S9 M3 (Kubernetes / container orchestration): HPA is auto-horizontal, VPA is auto-vertical; both are mechanisms, not strategies, and running both on the same pod without thought is a classic self-inflicted wound.
S10 M3 (capstone deploy): "how do we handle 10x traffic?" is a capstone-review question. The credible answer names the tier, the axis, and the constraint.
S10 M4 (operational readiness): the reviewer will ask "what is the largest single failure domain," and the answer that earns approval is bounded by the horizontal-scaling decisions you made here.

The underlying pattern: scaling decisions set the shape of everything downstream. A team that defaults to "make it bigger" without asking "can we split it" is a team that will hit the ceiling suddenly; a team that over-splits without cause is a team that pays for distribution on a workload that did not need it. Calibrate.

One leadership move worth naming: when presenting a scaling plan, always include the cheapest next step (often a vertical bump plus a targeted cache) as the comparison point against the architecturally-pure redesign. The cheap step buys time; the redesign buys ceiling. Choose knowingly.

A final calibration: the teams that scale best almost always do the cheap step first while starting the redesign in parallel so the redesign lands before the next ceiling. This is a roadmap shape, not a single ticket, and it belongs in the module's leadership cluster as much as here.

Read This Only If Stuck

Local chunks (book anchors)

System Design Primer: Performance vs Scalability -- the short form; keep it in mind when someone says "performance" and means "scalability" (or vice versa).
System Design Primer: Load Balancer -- the mechanism that makes horizontal scaling observable to clients.
System Design Primer: Reverse Proxy -- the layer above the load balancer; shapes how horizontal fleets present a single URL.
System Design Primer: Application Layer and Microservices -- what "stateless app tier" actually looks like when decomposed.
System Design Primer: Database federation and sharding -- horizontal scaling for stateful tiers; read the "joins across shards" section in full before proposing this.
System Design Primer: Database RDBMS Replication -- read scaling without sharding; a near-universal first step.
FoSA: Cross-Cutting Architecture Characteristics -- scalability and elasticity as cross-cutting; the vertical-vs-horizontal choice lives here.
FoSA: Space-Based Architecture Style -- a style whose entire point is to make every tier horizontally scalable by pushing data into replicated in-memory grids.

External canonical references

Werner Vogels, A Word on Scalability (2006) -- the definitional piece; still the most precise two pages on the topic.
Martin Abbott and Michael Fisher, The Scale Cube (AKF Partners) -- the X/Y/Z axis decomposition of scaling (replicate, split by function, split by customer). Every sharding plan is a Z-axis move.
Marc Brooker, Scaling Laws and Vertical Scaling -- why the "just use a bigger machine" instinct is often right for small/medium systems and wrong for large ones, from an AWS principal engineer.
AWS Builders' Library, Workload isolation using shuffle-sharding -- the horizontal-scale-plus-isolation pattern AWS uses inside Route 53 and other tier-1 services.
Martin Kleppmann, Designing Data-Intensive Applications, ch. 5-6 (replication and partitioning) -- the book chapters to read before you sign off on any sharded-database plan.
Google SRE Workbook, Managing Load -- the operational contract for serving load across a horizontal fleet.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Transfer / Where This Shows Up Later​

Read This Only If Stuck​

Local chunks (book anchors)​

External canonical references​