Skip to main content

Regions, Availability Zones, and Failure Domains

What This Concept Is

A region is a geographic area (for example, us-east-1 in Virginia, europe-west1 near Brussels, australiaeast in Sydney). Regions are isolated from each other by design: a failure in us-east-1 does not propagate to eu-west-1. Each region has its own control plane, its own blast radius, and its own legal jurisdiction.

An availability zone (AZ) is a logically separate datacenter or cluster of datacenters inside a region, with its own power, cooling, and network. A region typically has 3-6 AZs. Two AZs in the same region are close enough for low-latency synchronous replication (usually <2 ms) but far enough that a single flood, power event, or hardware batch should not take both down.

The failure-domain hierarchy is roughly:

  • host - a single physical server (often your smallest implicit failure domain for a VM)
  • rack - one row of hosts sharing power and switch
  • AZ - an isolated datacenter unit
  • region - a whole geographic area
  • provider - the rarest but not zero case (a control-plane bug, a global routing incident, a bad DNS push)

Cross-provider names differ but the shape matches:

  • AWS: Region -> Availability Zone (3-6 per region) -> rack/host
  • GCP: Region -> Zone (usually 3 per region) -> cluster/rack
  • Azure: Region -> Availability Zone (3 per "zonal" region) -> rack

Designing in the cloud means deciding, for each workload, how many of these failure domains it must survive.

Why It Matters Here

Every later decision in this semester assumes you know what a region and an AZ are:

  • a VPC is scoped to a region, its subnets are scoped to AZs (Cluster 3)
  • a load balancer can distribute traffic across AZs; a bad config can strand you in one (Cluster 3)
  • S3 is region-scoped; EBS volumes are AZ-scoped; you cannot attach an EBS volume across AZs (Cluster 4)
  • RDS Multi-AZ is a different thing from RDS cross-region replication, and they cost different amounts (Cluster 4)
  • data egress between regions is expensive; between AZs it is still metered (Cluster 4)
  • IAM is global but Route 53, CloudFront, and IAM's control-plane writes anchor in us-east-1 - a bad day there affects "global" services (Cluster 5)

If you cannot name your region and AZs, you cannot reason about failure, latency, cost, or compliance.

Concrete Example

You deploy a single EC2 instance in us-east-1a. A storm knocks out the power in that one AZ.

  • the instance is gone until power returns
  • any EBS volume attached to it is inaccessible, because EBS is AZ-scoped
  • the other two AZs (us-east-1b, us-east-1c) are unaffected
  • the rest of the region (S3, Route 53, IAM, and global services) is unaffected
  • no other region even notices

If you instead ran an autoscaling group across 1a, 1b, and 1c behind an Application Load Balancer, the ALB's health checks would fail the 1a target, and traffic would continue flowing through 1b and 1c while 1a recovers. This is the minimum "AZ-resilient" design.

For "region-resilient," you need a second region (different datacenters, different power, different legal jurisdiction) and some way to route traffic there - typically Route 53 failover, Global Accelerator, or DNS-based multi-region. The cost is real: data replication across regions is billed per GB, and running hot-hot in two regions roughly doubles your infrastructure bill.

Cross-provider sanity check. GCP offers regional resources (load balancers, Cloud SQL HA with two zones) and multi-regional (Cloud Storage "multi-region buckets" stored across several regions transparently). Azure exposes three flavors: zonal services (pinned to a single AZ), zone-redundant services (replicated across AZs transparently), and regional services (no zonal awareness). The mental model - "which failure does this specific resource survive?" - generalizes even when the labels don't.

Before you commit, run aws ec2 describe-availability-zones --region us-east-1 --all-availability-zones (or the GCP gcloud compute zones list equivalent) and save the output. Treat AZ lists like you would treat a subnet allocation - a design artifact, not a runtime surprise.

Common Confusion / Misconception

"Multi-AZ means multi-region." No. Multi-AZ is cheap, low-latency, and survives a datacenter outage. Multi-region is expensive, higher-latency, and survives a whole regional outage. They solve different problems.

"An AZ letter is stable across accounts." No. AWS maps AZ names (us-east-1a) to physical AZs per account to balance load. Your 1a is not necessarily the same physical AZ as another account's 1a. When you need to pin (for example, to match another account's zone on a shared VPC), use AZ IDs (e.g., use1-az1).

"Global services are free of region concerns." Services like IAM, Route 53, CloudFront, and Cloud DNS have a global control plane, but their data plane still lives somewhere. S3 is region-scoped; the name is globally unique but the data and its durability are bound to one region. IAM's writes on AWS anchor in us-east-1 - if it is sick, new policy edits anywhere will be slow.

"More AZs is always better." Adding a third AZ after two is strictly better for durability of stateless services. Adding a fourth rarely pays back, because the provider engineers for roughly N=3 failure independence - and your bill linearly grows.

Gotcha: AZ capacity is finite. During the opening hours of a regional incident, other AZs in the same region see a stampede and may temporarily refuse new instance launches. Your ASG with MinHealthyPercentage=100 may silently fail to recover. Test for this by forcing a scale-out with a capacity-constrained instance type; the failure mode is important to know before the real incident.

How To Use It

For every workload, before you write any config:

  1. Name the region and write down why (latency to users, compliance, service availability, cost, feature completeness - some services launch in us-east-1 six months before they land elsewhere).
  2. Decide the failure domain the workload must survive: single-AZ, multi-AZ, or multi-region. Write the RTO/RPO targets next to that decision.
  3. For each stateful component (database, cache, object store), confirm whether its replication story matches that goal.
  4. For each stateless component, confirm it can be scheduled in any AZ (no hidden AZ affinity in storage).
  5. Write down what happens during a single-AZ failure, and run a tabletop test of it.
  6. For multi-region designs, write down the promotion procedure and test it - cross-region failover that has never been rehearsed usually fails when used.
  7. Audit the region list against data-residency rules: which regions are you allowed to put this data in?

Check Yourself

  1. Why can you not attach one EBS volume to instances in two different AZs?
  2. An RDS instance is set to "Multi-AZ." Does that protect you from the region-wide failure of us-east-1?
  3. Your app is deployed in us-east-1a only. Your RTO is 30 minutes. What is wrong?
  4. Name a global AWS service whose control-plane writes historically depend on us-east-1. Why does that matter to a multi-region design?
  5. You have a 50 ms RTT user budget. Synchronous replication of writes from us-east-1 to eu-west-1 is ~80 ms RTT. What does this imply for "active-active multi-region"?

Mini Drill or Application

Pick a region near you and one far away. In ten minutes:

  1. List 4-6 AZs in each region (use the provider CLI, not guesswork).
  2. For a simple web app (ALB + app servers + database + object store), decide which AZs each component lives in.
  3. Mark which components survive a single-AZ failure and which need extra work.
  4. Estimate, in words only, how much more expensive cross-region replication would be versus multi-AZ (order of magnitude).
  5. Draft a one-line entry for your ADR log: "We picked region X because Y; we tolerate Z failures; we do not tolerate W."

Read This Only If Stuck