Backup and Recovery: the Forgotten Basics
What This Concept Is
Two numbers define every recovery story:
- RPO (Recovery Point Objective): how much data, measured in time, you are allowed to lose. "Our RPO is 15 minutes" means after a catastrophe you accept losing the last 15 minutes of writes.
- RTO (Recovery Time Objective): how long you have to be back up. "Our RTO is 2 hours" means from "oh no" to "serving traffic again" in two hours.
A backup is a tool for meeting your RPO. A restore drill is the only evidence that you can meet your RTO. Untested backups are rumors; the bit rot, the permission drift, the missing schema -- all are only discovered during a restore. You must execute one restore, end-to-end, before the PRR.
Scope for a capstone:
- identify the one data store whose loss is catastrophic (usually the primary DB)
- define RPO and RTO in one sentence each
- verify backups exist, are encrypted, and are retained
- restore into a separate instance, time it, and write down what broke
The framing generalises: NIST SP 800-34 (Contingency Planning Guide for Federal Information Systems) calls these the "recovery objectives," and they apply equally to a two-person capstone and a Fortune-500 disaster recovery plan. The difference is not the framework; it is the scale of the numbers and the breadth of the scope.
Why It Matters Here (In the Capstone)
"We have backups" is something every team says until a restore is needed, at which point something is always wrong: the backup is corrupt, the IAM principal cannot read it, the restore target has the wrong encryption key, the schema has drifted. Finding that out at 3 a.m. under load is the worst possible time. Finding it out in a scheduled 90-minute drill this week is free.
The PRR has a row that says "backup tested within the last 30 days." Untested = not credible. A single drill with a written log is enough for a capstone -- but the log must exist, and it must record both the time taken and the issues found.
Concrete Example -- from a real capstone
For the webhook-handler capstone:
Primary data store: Postgres (RDS), one instance, one AZ, holds events and notification states.
RPO and RTO (documented in library/raw/recovery.md):
- RPO: 15 minutes. We accept losing up to the last 15 minutes of events if the DB is lost.
- RTO: 2 hours. We commit to restoring service within two hours of declaration.
Backup posture:
- RDS automated backups enabled, 7-day retention.
- Point-in-time recovery enabled (granularity 5 minutes; meets RPO).
- One weekly manual snapshot pushed to a separate account for ransomware / account-compromise protection.
- Encryption at rest with a KMS key whose rotation is managed by AWS.
- Backups defined in Terraform alongside the RDS instance, so schedule and retention cannot drift.
Restore drill log (executed once before PRR):
Date: 2026-04-22
Operator: self
Goal: restore yesterday's 14:00 snapshot into a new RDS instance
and verify a known event is present.
Timeline:
T+0 start drill
T+2m initiated restore of snapshot `capstone-2026-04-21-1400` into new instance
T+31m instance available, connection tested
T+33m ran `SELECT count(*) FROM events WHERE created_at = '2026-04-21 13:55:17'`
row count: 1, matches production record
T+35m applied latest migrations (two migrations newer than snapshot)
T+41m migrations applied; schema matches current prod
T+44m pointed staging API at restored DB; smoke tests passed
T+48m tore down restored instance
Total restore time: 44 minutes. Under RTO (2h).
Data loss at restore point: snapshot was 24h old; RPO test using PITR to 2026-04-22 12:50:00 gave < 10 min loss. Within RPO (15m).
Issues found:
1. Default parameter group on new instance lacked our `shared_preload_libraries`.
Added that to the restore runbook.
2. IAM user used for connectivity test did not have `rds-db:connect` on the new
instance. Added a "grant connect to the operator" step to the runbook.
Next drill scheduled: 2026-05-22.
That log -- with the date, timeline, issues found, and next drill date -- is the artifact that passes PRR. Not a policy; a record. A clean first-run drill is suspicious; the two issues above are typical, expected, and what the drill exists to surface.
Common Confusion / Misconceptions
"We enable backups, therefore we are backed up." Enabling automated backups is step one of a four-step job. Steps two through four are: verify the backup artifact exists, verify you can read it with the credentials you would use in an incident, and verify a restore actually produces a usable database.
"RPO and RTO are the same thing." They are complementary. RPO is a data-loss budget; RTO is a downtime budget. You can have a tight RPO with a loose RTO (continuous replication but slow failover), or a loose RPO with a tight RTO (hourly snapshots + fast restore). They trade off against cost and complexity.
"Backups are the only recovery strategy." For durability, yes. For availability, no. A hot replica, a multi-AZ deployment, or a read-replica that can be promoted are availability tools. Capstones often only need backups plus periodic restore drills; multi-AZ is a nice-to-have, not a backup substitute.
"Snapshots in the same account are enough." They protect against accidental delete but not against account compromise or ransomware-on-the-cloud-console. Cross-account or cross-region snapshots are cheap insurance and align with SLSA / defense-in-depth thinking.
"The restore is too risky to practice." A drill into a separate instance is not risky. A drill into the production instance is insane. Always restore into a new target; diff the results; then decide whether to cut over.
"Our RPO/RTO are whatever the cloud defaults give us." No. RPO and RTO are business numbers, not infrastructure numbers. Write them first; then pick the backup mechanism that meets them. Doing it the other way around produces "our RPO is whatever we happen to have," which is not a commitment.
How To Use It (In Your Capstone)
- Name your one catastrophic-loss data store. One.
- Write RPO and RTO as one sentence each. Both must be numbers.
- Document the backup posture: schedule, retention, encryption, cross-account copy. Express it in IaC.
- Execute one restore into a new instance. Time it. Keep a log.
- Log issues found and fix the runbook. Schedule the next drill.
- Put the restore log in
library/raw/recovery.mdand link to it from the PRR. - Re-run a drill whenever the schema, engine version, or backup configuration changes materially.
See also (integrative)
- S8 M04 Cluster 3 -- failure modes (cascading, correlated, gray): the taxonomy that motivates cross-account copies as defense against correlated provider-account failure:
../../../../semester-08-system-design-leadership/module-04-scale-reliability-performance/concepts/cluster-03-reliability-engineering/08-failure-modes-cascading-correlated-gray-primary.md. - S9 M02 Infrastructure as Code -- your backup schedule, retention, and cross-account copy should all be IaC-defined so they cannot drift:
../../../../semester-09-cloud-devops/module-02-infrastructure-as-code/. - S9 M05 Cluster 2 -- encryption at rest, KMS, envelope encryption: the crypto baseline your backups must satisfy:
../../../../semester-09-cloud-devops/module-05-cloud-security-observability/concepts/cluster-02-secrets-keys-and-data/05-encryption-at-rest-in-transit-kms-envelope-primary.md. - S8 M04 Cluster 5 -- incident lifecycle: the restore drill is an in-miniature incident and should use the same declaration / operator / post-review shape:
../../../../semester-08-system-design-leadership/module-04-scale-reliability-performance/concepts/cluster-05-incident-and-observability/14-incident-lifecycle-detect-triage-mitigate-resolve-review-primary.md. - Google SRE Workbook -- Incident Response -- role separation and declaration process that the restore drill rehearses in miniature.
- NIST SP 800-34 Rev. 1 -- Contingency Planning Guide -- the canonical public-sector reference for RPO, RTO, and backup strategy.
- AWS -- Restoring a DB Instance to a Specified Time -- the vendor-specific mechanics for the example above.
Check Yourself
- Why is an untested backup effectively a rumor from a PRR perspective?
- Give one failure mode that a same-account snapshot does not protect against.
- What is the smallest change that would let you test restore weekly without burning a weekend?
- Why should RPO and RTO be written before the backup mechanism is chosen, not after?
- How does a cross-account snapshot copy defend against a failure mode that a multi-AZ replica does not?
- What single artifact do you hand to an examiner to claim your recovery story is real?
Mini Drill or Application (Capstone-scoped)
- Write
library/raw/recovery.mdwith RPO, RTO, backup schedule, retention, and encryption details. - Execute one restore from yesterday's snapshot into a new database instance.
- Run one canary query against the restored instance that you also run against prod; diff results.
- Record total elapsed time. If it exceeds your RTO, either invest in faster restore (e.g., PITR, read-replica promotion) or loosen the RTO honestly.
- Log all issues found and close them as follow-up tickets. Put the log in the runbook.
- Add a cross-account or cross-region snapshot copy step if it does not already exist. Confirm the destination is read-only from the source account.
Source Backbone
Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.
- Building Secure and Reliable Systems - primary security and reliability backbone.
- Software Engineering at Google - operational process and engineering discipline.
- Designing Distributed Systems - service and reliability pattern support.