Skip to main content

Backup and Recovery: the Forgotten Basics

What This Concept Is

Two numbers define every recovery story:

  • RPO (Recovery Point Objective): how much data, measured in time, you are allowed to lose. "Our RPO is 15 minutes" means after a catastrophe you accept losing the last 15 minutes of writes.
  • RTO (Recovery Time Objective): how long you have to be back up. "Our RTO is 2 hours" means from "oh no" to "serving traffic again" in two hours.

A backup is a tool for meeting your RPO. A restore drill is the only evidence that you can meet your RTO. Untested backups are rumors; the bit rot, the permission drift, the missing schema -- all are only discovered during a restore. You must execute one restore, end-to-end, before the PRR.

Scope for a capstone:

  • identify the one data store whose loss is catastrophic (usually the primary DB)
  • define RPO and RTO in one sentence each
  • verify backups exist, are encrypted, and are retained
  • restore into a separate instance, time it, and write down what broke

The framing generalises: NIST SP 800-34 (Contingency Planning Guide for Federal Information Systems) calls these the "recovery objectives," and they apply equally to a two-person capstone and a Fortune-500 disaster recovery plan. The difference is not the framework; it is the scale of the numbers and the breadth of the scope.

Why It Matters Here (In the Capstone)

"We have backups" is something every team says until a restore is needed, at which point something is always wrong: the backup is corrupt, the IAM principal cannot read it, the restore target has the wrong encryption key, the schema has drifted. Finding that out at 3 a.m. under load is the worst possible time. Finding it out in a scheduled 90-minute drill this week is free.

The PRR has a row that says "backup tested within the last 30 days." Untested = not credible. A single drill with a written log is enough for a capstone -- but the log must exist, and it must record both the time taken and the issues found.

Concrete Example -- from a real capstone

For the webhook-handler capstone:

Primary data store: Postgres (RDS), one instance, one AZ, holds events and notification states.

RPO and RTO (documented in library/raw/recovery.md):

  • RPO: 15 minutes. We accept losing up to the last 15 minutes of events if the DB is lost.
  • RTO: 2 hours. We commit to restoring service within two hours of declaration.

Backup posture:

  • RDS automated backups enabled, 7-day retention.
  • Point-in-time recovery enabled (granularity 5 minutes; meets RPO).
  • One weekly manual snapshot pushed to a separate account for ransomware / account-compromise protection.
  • Encryption at rest with a KMS key whose rotation is managed by AWS.
  • Backups defined in Terraform alongside the RDS instance, so schedule and retention cannot drift.

Restore drill log (executed once before PRR):

Date:        2026-04-22
Operator: self
Goal: restore yesterday's 14:00 snapshot into a new RDS instance
and verify a known event is present.

Timeline:
T+0 start drill
T+2m initiated restore of snapshot `capstone-2026-04-21-1400` into new instance
T+31m instance available, connection tested
T+33m ran `SELECT count(*) FROM events WHERE created_at = '2026-04-21 13:55:17'`
row count: 1, matches production record
T+35m applied latest migrations (two migrations newer than snapshot)
T+41m migrations applied; schema matches current prod
T+44m pointed staging API at restored DB; smoke tests passed
T+48m tore down restored instance

Total restore time: 44 minutes. Under RTO (2h).
Data loss at restore point: snapshot was 24h old; RPO test using PITR to 2026-04-22 12:50:00 gave < 10 min loss. Within RPO (15m).

Issues found:
1. Default parameter group on new instance lacked our `shared_preload_libraries`.
Added that to the restore runbook.
2. IAM user used for connectivity test did not have `rds-db:connect` on the new
instance. Added a "grant connect to the operator" step to the runbook.

Next drill scheduled: 2026-05-22.

That log -- with the date, timeline, issues found, and next drill date -- is the artifact that passes PRR. Not a policy; a record. A clean first-run drill is suspicious; the two issues above are typical, expected, and what the drill exists to surface.

Common Confusion / Misconceptions

"We enable backups, therefore we are backed up." Enabling automated backups is step one of a four-step job. Steps two through four are: verify the backup artifact exists, verify you can read it with the credentials you would use in an incident, and verify a restore actually produces a usable database.

"RPO and RTO are the same thing." They are complementary. RPO is a data-loss budget; RTO is a downtime budget. You can have a tight RPO with a loose RTO (continuous replication but slow failover), or a loose RPO with a tight RTO (hourly snapshots + fast restore). They trade off against cost and complexity.

"Backups are the only recovery strategy." For durability, yes. For availability, no. A hot replica, a multi-AZ deployment, or a read-replica that can be promoted are availability tools. Capstones often only need backups plus periodic restore drills; multi-AZ is a nice-to-have, not a backup substitute.

"Snapshots in the same account are enough." They protect against accidental delete but not against account compromise or ransomware-on-the-cloud-console. Cross-account or cross-region snapshots are cheap insurance and align with SLSA / defense-in-depth thinking.

"The restore is too risky to practice." A drill into a separate instance is not risky. A drill into the production instance is insane. Always restore into a new target; diff the results; then decide whether to cut over.

"Our RPO/RTO are whatever the cloud defaults give us." No. RPO and RTO are business numbers, not infrastructure numbers. Write them first; then pick the backup mechanism that meets them. Doing it the other way around produces "our RPO is whatever we happen to have," which is not a commitment.

How To Use It (In Your Capstone)

  1. Name your one catastrophic-loss data store. One.
  2. Write RPO and RTO as one sentence each. Both must be numbers.
  3. Document the backup posture: schedule, retention, encryption, cross-account copy. Express it in IaC.
  4. Execute one restore into a new instance. Time it. Keep a log.
  5. Log issues found and fix the runbook. Schedule the next drill.
  6. Put the restore log in library/raw/recovery.md and link to it from the PRR.
  7. Re-run a drill whenever the schema, engine version, or backup configuration changes materially.

See also (integrative)

Check Yourself

  1. Why is an untested backup effectively a rumor from a PRR perspective?
  2. Give one failure mode that a same-account snapshot does not protect against.
  3. What is the smallest change that would let you test restore weekly without burning a weekend?
  4. Why should RPO and RTO be written before the backup mechanism is chosen, not after?
  5. How does a cross-account snapshot copy defend against a failure mode that a multi-AZ replica does not?
  6. What single artifact do you hand to an examiner to claim your recovery story is real?

Mini Drill or Application (Capstone-scoped)

  1. Write library/raw/recovery.md with RPO, RTO, backup schedule, retention, and encryption details.
  2. Execute one restore from yesterday's snapshot into a new database instance.
  3. Run one canary query against the restored instance that you also run against prod; diff results.
  4. Record total elapsed time. If it exceeds your RTO, either invest in faster restore (e.g., PITR, read-replica promotion) or loosen the RTO honestly.
  5. Log all issues found and close them as follow-up tickets. Put the log in the runbook.
  6. Add a cross-account or cross-region snapshot copy step if it does not already exist. Confirm the destination is read-only from the source account.

Source Backbone

Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.