Backup and Recovery: the Forgotten Basics

What This Concept Is

Two numbers define every recovery story:

RPO (Recovery Point Objective): how much data, measured in time, you are allowed to lose. "Our RPO is 15 minutes" means after a catastrophe you accept losing the last 15 minutes of writes.
RTO (Recovery Time Objective): how long you have to be back up. "Our RTO is 2 hours" means from "oh no" to "serving traffic again" in two hours.

A backup is a tool for meeting your RPO. A restore drill is the only evidence that you can meet your RTO. Untested backups are rumors; the bit rot, the permission drift, the missing schema -- all are only discovered during a restore. You must execute one restore, end-to-end, before the PRR.

Scope for a capstone:

identify the one data store whose loss is catastrophic (usually the primary DB)
define RPO and RTO in one sentence each
verify backups exist, are encrypted, and are retained
restore into a separate instance, time it, and write down what broke

The framing generalises: NIST SP 800-34 (Contingency Planning Guide for Federal Information Systems) calls these the "recovery objectives," and they apply equally to a two-person capstone and a Fortune-500 disaster recovery plan. The difference is not the framework; it is the scale of the numbers and the breadth of the scope.

Why It Matters Here (In the Capstone)

"We have backups" is something every team says until a restore is needed, at which point something is always wrong: the backup is corrupt, the IAM principal cannot read it, the restore target has the wrong encryption key, the schema has drifted. Finding that out at 3 a.m. under load is the worst possible time. Finding it out in a scheduled 90-minute drill this week is free.

The PRR has a row that says "backup tested within the last 30 days." Untested = not credible. A single drill with a written log is enough for a capstone -- but the log must exist, and it must record both the time taken and the issues found.

Concrete Example -- from a real capstone

For the webhook-handler capstone:

Primary data store: Postgres (RDS), one instance, one AZ, holds events and notification states.

RPO and RTO (documented in library/raw/recovery.md):

RPO: 15 minutes. We accept losing up to the last 15 minutes of events if the DB is lost.
RTO: 2 hours. We commit to restoring service within two hours of declaration.

Backup posture:

RDS automated backups enabled, 7-day retention.
Point-in-time recovery enabled (granularity 5 minutes; meets RPO).
One weekly manual snapshot pushed to a separate account for ransomware / account-compromise protection.
Encryption at rest with a KMS key whose rotation is managed by AWS.
Backups defined in Terraform alongside the RDS instance, so schedule and retention cannot drift.

Restore drill log (executed once before PRR):

Date:        2026-04-22
Operator:    self
Goal:        restore yesterday's 14:00 snapshot into a new RDS instance
             and verify a known event is present.

Timeline:
  T+0     start drill
  T+2m    initiated restore of snapshot `capstone-2026-04-21-1400` into new instance
  T+31m   instance available, connection tested
  T+33m   ran `SELECT count(*) FROM events WHERE created_at = '2026-04-21 13:55:17'`
          row count: 1, matches production record
  T+35m   applied latest migrations (two migrations newer than snapshot)
  T+41m   migrations applied; schema matches current prod
  T+44m   pointed staging API at restored DB; smoke tests passed
  T+48m   tore down restored instance

Total restore time: 44 minutes. Under RTO (2h).
Data loss at restore point: snapshot was 24h old; RPO test using PITR to 2026-04-22 12:50:00 gave < 10 min loss. Within RPO (15m).

Issues found:
  1. Default parameter group on new instance lacked our `shared_preload_libraries`.
     Added that to the restore runbook.
  2. IAM user used for connectivity test did not have `rds-db:connect` on the new
     instance. Added a "grant connect to the operator" step to the runbook.

Next drill scheduled: 2026-05-22.

That log -- with the date, timeline, issues found, and next drill date -- is the artifact that passes PRR. Not a policy; a record. A clean first-run drill is suspicious; the two issues above are typical, expected, and what the drill exists to surface.

Common Confusion / Misconceptions

"We enable backups, therefore we are backed up." Enabling automated backups is step one of a four-step job. Steps two through four are: verify the backup artifact exists, verify you can read it with the credentials you would use in an incident, and verify a restore actually produces a usable database.

"RPO and RTO are the same thing." They are complementary. RPO is a data-loss budget; RTO is a downtime budget. You can have a tight RPO with a loose RTO (continuous replication but slow failover), or a loose RPO with a tight RTO (hourly snapshots + fast restore). They trade off against cost and complexity.

"Backups are the only recovery strategy." For durability, yes. For availability, no. A hot replica, a multi-AZ deployment, or a read-replica that can be promoted are availability tools. Capstones often only need backups plus periodic restore drills; multi-AZ is a nice-to-have, not a backup substitute.

"Snapshots in the same account are enough." They protect against accidental delete but not against account compromise or ransomware-on-the-cloud-console. Cross-account or cross-region snapshots are cheap insurance and align with SLSA / defense-in-depth thinking.

"The restore is too risky to practice." A drill into a separate instance is not risky. A drill into the production instance is insane. Always restore into a new target; diff the results; then decide whether to cut over.

"Our RPO/RTO are whatever the cloud defaults give us." No. RPO and RTO are business numbers, not infrastructure numbers. Write them first; then pick the backup mechanism that meets them. Doing it the other way around produces "our RPO is whatever we happen to have," which is not a commitment.

How To Use It (In Your Capstone)

Name your one catastrophic-loss data store. One.
Write RPO and RTO as one sentence each. Both must be numbers.
Document the backup posture: schedule, retention, encryption, cross-account copy. Express it in IaC.
Execute one restore into a new instance. Time it. Keep a log.
Log issues found and fix the runbook. Schedule the next drill.
Put the restore log in library/raw/recovery.md and link to it from the PRR.
Re-run a drill whenever the schema, engine version, or backup configuration changes materially.

Check Yourself

Why is an untested backup effectively a rumor from a PRR perspective?
Give one failure mode that a same-account snapshot does not protect against.
What is the smallest change that would let you test restore weekly without burning a weekend?
Why should RPO and RTO be written before the backup mechanism is chosen, not after?
How does a cross-account snapshot copy defend against a failure mode that a multi-AZ replica does not?
What single artifact do you hand to an examiner to claim your recovery story is real?

Mini Drill or Application (Capstone-scoped)

Write library/raw/recovery.md with RPO, RTO, backup schedule, retention, and encryption details.
Execute one restore from yesterday's snapshot into a new database instance.
Run one canary query against the restored instance that you also run against prod; diff results.
Record total elapsed time. If it exceeds your RTO, either invest in faster restore (e.g., PITR, read-replica promotion) or loosen the RTO honestly.
Log all issues found and close them as follow-up tickets. Put the log in the runbook.
Add a cross-account or cross-region snapshot copy step if it does not already exist. Confirm the destination is read-only from the source account.

Source Backbone

Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.

Building Secure and Reliable Systems - primary security and reliability backbone.
Software Engineering at Google - operational process and engineering discipline.
Designing Distributed Systems - service and reliability pattern support.

What This Concept Is​

Why It Matters Here (In the Capstone)​

Concrete Example -- from a real capstone​

Common Confusion / Misconceptions​

How To Use It (In Your Capstone)​

See also (integrative)​

Check Yourself​

Mini Drill or Application (Capstone-scoped)​

Source Backbone​