“We can recover in an hour” is not a plan — it’s a vibe. Real disaster recovery is about turning RPO (how much data you can lose) and RTO (how long you can be down) into concrete engineering choices: backups, replication, failover, runbooks, and drills. This guide shows how to set targets you can defend, build a recovery design that matches them, and test it until you trust it.
Quickstart
If you only have 60–90 minutes, do this. These steps create clarity fast and prevent the two classic failure modes: (1) “we don’t know what we’re protecting” and (2) “we’ve never tried restoring.”
1) Write RPO/RTO per service (not “for the whole company”)
- List your top 5–10 services (API, DB, auth, queue, object storage, etc.)
- For each, write RPO and RTO as a number (minutes/hours)
- Add a one-liner: “If this is down, users can/can’t…”
- Mark the “tier 0” dependencies (auth, DNS, DB, secrets)
2) Map one recovery path end-to-end
- Pick one service + its data store (e.g., web app + Postgres)
- Decide your recovery mode: restore-from-backup, warm standby, or active-active
- Write a 10-step runbook with owners and expected durations
- Time-box uncertainty: anything “we’ll figure it out” becomes a task
3) Prove you can restore (today)
Backups aren’t real until you can restore them quickly and safely.
- Restore to a separate environment (never over prod)
- Verify data integrity: can the app query key tables / run a smoke test?
- Record the actual time: this becomes your first measured RTO
- Capture “gotchas” and turn them into automation
4) Add two “boring” hardening moves
- Immutability: protect backups from deletion (accidental or malicious)
- Separation: keep recovery credentials separate from day-to-day ops
- Monitoring: alert on missed backups and stale replicas
- Documentation: store runbooks where you can access them during an outage
You have a usable DR baseline when: (1) every critical service has an RPO/RTO, (2) backups are monitored and protected, (3) you’ve completed at least one restore drill, and (4) you can explain the recovery path without guessing.
Overview
Disaster recovery (DR) is the set of practices that let you restore service after a serious outage: cloud region failure, ransomware, a broken deploy, corrupted data, a human mistake, or “we deleted the wrong thing.” The goal is not heroics — it’s repeatability.
What this post covers
- How to set RPO/RTO targets based on business impact (not optimism)
- How to choose a recovery strategy: backup/restore vs warm standby vs active-active
- How to build runbooks and automation that actually work under stress
- How to test recoveries and measure the real numbers (your first “true” RTO)
- Common mistakes that create false confidence (and how to avoid them)
| Approach | Typical RPO/RTO range | Best for | Trade-off |
|---|---|---|---|
| Backup & restore | RPO: hours → minutes RTO: hours |
Most teams, cost-sensitive systems | Restore time + operational steps are the bottleneck |
| Warm standby | RPO: minutes RTO: minutes → hour |
Revenue-critical services | Higher cost + more moving parts (replication, failover) |
| Active-active | RPO: near-zero RTO: near-zero |
Very high availability requirements | Hardest to build correctly (consistency, split-brain, traffic) |
Backups answer “can we get data back?” DR answers “can we bring the system back within our target time?” A backup with a six-hour restore process is still a six-hour RTO.
Core concepts
DR becomes simple when you separate targets (what must be true) from mechanisms (how you achieve it), and you treat recovery as an engineered workflow you can run repeatedly.
RPO: Recovery Point Objective
How much data you can lose, measured in time. If your last good copy of data is from 10:15 and the incident is at 10:30, your loss is 15 minutes. Your RPO must be at least that good.
- RPO is mostly about data freshness
- It’s controlled by backup frequency or replication lag
- Near-zero RPO implies continuous replication and hard operational discipline
RTO: Recovery Time Objective
How long you can be down before unacceptable impact. RTO is the time from “incident declared” to “service is usable again” (not “we started restoring”).
- RTO is mostly about process + automation
- It includes detection, decision, restore/failover, validation, and traffic cutover
- If it’s not practiced, the real RTO is always worse than the guessed one
RPO/RTO are not the only numbers (but they’re the ones everyone remembers)
| Term | Meaning | Why it matters in practice |
|---|---|---|
| MTD / MAO | Maximum tolerable downtime (upper bound) | When exceeded, impact becomes existential (legal, revenue, trust) |
| RTA | Recovery time actual (measured) | Your drill result; the only number you should trust |
| RLA | Recovery level (what “restored” means) | Defines minimum viable function (read-only? reduced features?) |
| Blast radius | How wide a failure can spread | Controls correlated failures (same account, same region, same keys) |
Two mental models that prevent bad DR designs
Model 1 — Dependencies are the real “service”
Your API isn’t recoverable if its dependencies aren’t: database, object store, queue, identity, secrets, DNS, certificates, CI/CD, and the people who have access.
- Identify your tier-0 dependencies (auth, DB, secrets, DNS)
- Decide what you must restore first to make progress
- Design “minimal viable service” during recovery (reduced features)
Model 2 — DR is a workflow, not a feature
The fastest recovery is usually the one with fewer manual steps. Every undocumented or manual action adds minutes — and minutes compound under stress.
- Write runbooks with explicit commands and expected outputs
- Automate the repeatable parts (restore, validate, cut over)
- Practice on a schedule so “muscle memory” exists
“We have backups” is not the same as “we can recover.” The most expensive outages happen when a team assumes restore will be easy — and discovers missing credentials, stale data, broken scripts, or undocumented dependencies during the incident.
Step-by-step
This is a practical DR build path you can apply to most stacks (VMs, containers, Kubernetes, managed databases, self-hosted databases). The key is to start with a baseline that works, then iterate toward tighter RPO/RTO where it actually matters.
Step 1 — Inventory what you must recover (and in what order)
Make a one-page “recovery inventory”. If it isn’t written down, it won’t exist during an outage.
| Component | Owner | Dependencies | RPO | RTO |
|---|---|---|---|---|
| Primary database | DB / Platform | Storage, KMS/keys, network | 15 min | 60 min |
| API / backend | App team | DB, secrets, auth | 15 min | 90 min |
| Auth / identity | Platform | DNS, certificates | 60 min | 60 min |
| Object storage | Platform | IAM, keys | 60 min | 120 min |
Don’t overthink the numbers on day one. Write your best current estimate, then improve it after your first drill.
Step 2 — Choose a recovery strategy per tier
Not everything needs a premium DR strategy. Use tiers to spend effort where it reduces real risk.
A simple tiering approach
- Tier 0: auth, DB, secrets, DNS (recover first)
- Tier 1: revenue-critical services
- Tier 2: internal tools, dashboards, batch jobs
- Tier 3: dev/test environments
Match tier to mechanism
- Tier 0/1 often needs warm standby or very fast restore automation
- Tier 2 can usually be backup/restore with good runbooks
- Tier 3 is “best effort” (and that’s okay if explicit)
If your current budget and team size can’t support 5-minute RTO, don’t promise it. Commit to an achievable baseline, measure it, and improve deliberately.
Step 3 — Implement backups you can restore (not just store)
Backup design is about three things: frequency (RPO), restore speed (RTO), and survivability (protection from deletion and compromise).
Backup essentials
- Automate backups (no “run it manually”)
- Encrypt at rest and in transit
- Use retention rules (daily/weekly/monthly)
- Keep backups in a separate failure domain (account/project/credentials)
- Enable immutability or write-once protection where possible
Restore essentials
- Document the restore command(s) with expected outputs
- Restore into an isolated environment first
- Run integrity checks + app smoke tests
- Measure and record real restore duration (RTA)
- Practice regularly so it stays current
Below is a concrete example for backing up and restoring a Postgres database with a fast, repeatable flow. Adapt the storage backend and secrets handling to your environment (the pattern matters more than the tools).
#!/usr/bin/env bash
set -euo pipefail
# Example: Postgres logical backup to a timestamped file, then integrity check.
# Assumes: PGPASSWORD is set securely (env var, secret manager, injected at runtime).
# Tip: keep backup credentials separate from day-to-day admin credentials.
BACKUP_DIR="${BACKUP_DIR:-/backups}"
DB_HOST="${DB_HOST:-127.0.0.1}"
DB_PORT="${DB_PORT:-5432}"
DB_NAME="${DB_NAME:-appdb}"
DB_USER="${DB_USER:-appuser}"
ts="$(date -u +%Y%m%dT%H%M%SZ)"
file="${BACKUP_DIR}/${DB_NAME}_${ts}.sql.gz"
mkdir -p "${BACKUP_DIR}"
echo "[backup] starting: ${file}"
pg_dump --host "${DB_HOST}" --port "${DB_PORT}" --username "${DB_USER}" --format=p "${DB_NAME}" \
| gzip -9 > "${file}"
echo "[backup] verifying gzip stream..."
gzip -t "${file}"
echo "[backup] done: $(du -h "${file}" | awk '{print $1}')"
# --- restore example (run in a recovery environment) ---
# gunzip -c "${file}" | psql --host "${DB_HOST}" --port "${DB_PORT}" --username "${DB_USER}" --dbname "${DB_NAME}"
# psql --host "${DB_HOST}" --port "${DB_PORT}" --username "${DB_USER}" --dbname "${DB_NAME}" -c "SELECT 1;"
Never “test restore” by overwriting production. Always restore into a separate environment or a new instance, validate there, and only then cut over.
Step 4 — Automate the repeatable parts (backup jobs + validation)
You don’t need to automate everything on day one. Start with what you run often: backups, restores, and validation checks. Automation reduces RTO because it removes decision points and manual errors.
Here’s a minimal Kubernetes CronJob pattern (conceptually useful even if you don’t run Kubernetes): a scheduled backup, stored externally, with clear separation of config (env) and execution (container).
apiVersion: batch/v1
kind: CronJob
metadata:
name: pg-backup
spec:
schedule: "*/15 * * * *" # every 15 minutes (sets your theoretical RPO ceiling)
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 2
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 1
template:
spec:
restartPolicy: Never
containers:
- name: backup
image: alpine:3.20
command: ["/bin/sh", "-lc"]
args:
- |
set -euo pipefail
apk add --no-cache postgresql-client gzip
ts="$(date -u +%Y%m%dT%H%M%SZ)"
file="/tmp/${DB_NAME}_${ts}.sql.gz"
echo "[backup] writing ${file}"
pg_dump -h "${DB_HOST}" -U "${DB_USER}" "${DB_NAME}" | gzip -9 > "${file}"
gzip -t "${file}"
# Upload step omitted (S3/GCS/Azure/NFS) - keep it external + durable.
echo "[backup] done"
env:
- name: DB_HOST
valueFrom: { secretKeyRef: { name: db, key: host } }
- name: DB_USER
valueFrom: { secretKeyRef: { name: db, key: user } }
- name: DB_NAME
valueFrom: { secretKeyRef: { name: db, key: name } }
- name: PGPASSWORD
valueFrom: { secretKeyRef: { name: db, key: password } }
Notice two important details: concurrencyPolicy: Forbid (avoids overlapping backups) and a clear schedule (your theoretical RPO ceiling). Your actual RPO depends on whether jobs succeed and whether uploads are durable — so monitor both.
Step 5 — Write runbooks that reduce decision fatigue
During an incident, people are tired, stressed, and context-switched. A good runbook eliminates ambiguity. Keep it short, command-oriented, and validated in drills.
Runbook “minimum viable” template
- Trigger: what events activate this runbook?
- Goal: what does “restored” mean (RLA)?
- Owners: who can approve and who executes?
- Dependencies: secrets/keys/DNS/certificates required
- Steps: numbered actions with commands and expected results
- Validation: smoke tests + data integrity checks
- Rollback: if the restore is wrong, how do we revert?
- Post-incident: capture timings, gaps, and action items
Step 6 — Test DR like a product: drills, metrics, and iteration
DR testing is where guessing ends. Start small and repeat. A monthly “tiny drill” beats a yearly “big drill” that nobody remembers.
Drill types (start here)
- Tabletop: talk through a scenario and walk the runbook
- Restore drill: restore from backup into staging and validate
- Failover drill: switch to standby (if you have one)
- Game day: simulate incident conditions (access, latency, limited people)
What to measure
- Time to detect (TTD) + time to declare incident
- Time to restore data and service (your RTA)
- Data loss window observed (actual RPO)
- Number of manual steps and “unknowns” discovered
If you store backups as timestamped artifacts, you can compute a simple “backup staleness” indicator and alert when it exceeds your RPO. Here’s a small script pattern that checks the newest backup age in a folder (swap in your storage API if needed).
#!/usr/bin/env python3
import os
import sys
import time
from pathlib import Path
# Compute "backup staleness" (minutes since newest backup file).
# Usage:
# python backup_staleness.py /path/to/backup_dir 15
# where 15 is your RPO target in minutes.
def newest_mtime_seconds(path: Path) -> float:
mtimes = []
for p in path.iterdir():
if p.is_file():
mtimes.append(p.stat().st_mtime)
if not mtimes:
raise RuntimeError(f"No backup files found in {path}")
return max(mtimes)
def main() -> int:
if len(sys.argv) != 3:
print("Usage: backup_staleness.py <backup_dir> <rpo_minutes>")
return 2
backup_dir = Path(sys.argv[1])
rpo_minutes = float(sys.argv[2])
if not backup_dir.exists():
print(f"ERROR: {backup_dir} does not exist")
return 2
newest = newest_mtime_seconds(backup_dir)
age_minutes = (time.time() - newest) / 60.0
status = "OK" if age_minutes <= rpo_minutes else "ALERT"
print(f"{status}: newest backup age = {age_minutes:.1f} min (RPO target = {rpo_minutes:.1f} min)")
return 0 if status == "OK" else 1
if __name__ == "__main__":
raise SystemExit(main())
Reduce steps. Pre-provision infrastructure (IaC), keep recovery configs and secrets ready, automate restore + validation, and practice the cutover path. Teams often cut RTO by 50% just by removing ambiguity.
Step 7 — Operate DR continuously (so it doesn’t rot)
DR fails silently when it isn’t maintained. Your goal is to make “drift” visible and fix it before an incident.
- Monitor backups (success, duration, size) and replicas (lag)
- Alert on stale backups relative to RPO targets
- Rotate recovery credentials and test access paths
- Update runbooks after every architecture change
- Schedule drills (monthly small, quarterly deeper)
Common mistakes
Most DR failures are not “we didn’t buy enough infrastructure.” They’re process and assumptions. Here are the pitfalls that create false confidence — and the fixes that restore sanity.
Mistake 1 — One RPO/RTO for everything
A single number hides critical differences. Your DB and your marketing site do not need the same targets.
- Fix: set RPO/RTO per service tier, starting with tier-0 dependencies.
- Fix: define “minimal viable service” during recovery (what can be offline temporarily?).
Mistake 2 — Backups exist, but restores are untested
Many teams discover missing keys, corrupted archives, or incomplete data during the incident.
- Fix: schedule restore drills and validate with a smoke test.
- Fix: measure RTA and use it as your real baseline RTO.
Mistake 3 — Ignoring identity, secrets, and DNS
You can’t restore what you can’t access. And you can’t cut over traffic without DNS/certs.
- Fix: include IAM, KMS/keys, secrets, certs, and DNS in your recovery inventory.
- Fix: keep break-glass access documented and tested (with strict audit).
Mistake 4 — Correlated failure domains
Backups in the same account/region with the same credentials can fail together.
- Fix: separate failure domains (account/project/credentials; ideally region too).
- Fix: use immutability / retention protections to resist deletion.
Mistake 5 — “Warm standby” without real failover practice
Having replicas is not the same as having a repeatable switchover and validation path.
- Fix: document the cutover steps, including traffic routing and health checks.
- Fix: test failover under realistic constraints (reduced staff, limited access).
Mistake 6 — No definition of “restored”
Teams argue during incidents because nobody agreed on what “up” means.
- Fix: define RLA: read-only acceptable? partial features? degraded mode?
- Fix: add explicit validation checks and an “acceptance” owner.
Mistake 7 — Documentation that depends on the thing that’s down
If your runbook lives inside the affected system, it won’t be available when you need it.
- Fix: store runbooks in a separate, highly available place (and cache the essentials offline).
- Fix: keep a short “break-glass” checklist printable or easily accessible.
If production is down right now: do you know who declares the incident, where the runbook is, what you restore first, and how you validate? If any answer is fuzzy, that’s your next DR improvement.
FAQ
What’s the difference between disaster recovery and high availability?
High availability (HA) aims to prevent downtime during common failures (instance crashes, rolling deploys). Disaster recovery (DR) assumes a serious event already happened (region outage, data corruption, compromise) and focuses on restoring service and data within agreed targets. HA reduces the number of incidents; DR limits the damage when the big ones hit.
How do I choose RPO and RTO if stakeholders don’t know?
Start with impact framing: “If we lose 1 hour of data, what breaks?” and “If we’re down for 2 hours, what’s the cost?” If answers are vague, pick a conservative baseline (e.g., RPO 60 min, RTO 4 hours for tier 2), run a drill, measure, and then decide where tighter targets are worth the cost and complexity.
Is “daily backups” enough for most systems?
It depends on how much data loss you can tolerate. Daily backups imply a worst-case RPO close to 24 hours. If your system changes frequently (orders, messages, writes), daily backups often fail the reality test. Many teams move to 15–60 minute backups (or continuous replication) for critical data, and keep daily/weekly snapshots for longer retention.
What’s the fastest way to reduce RTO without major architecture changes?
Reduce manual steps. Pre-provision infrastructure with IaC, automate restore and validation, keep recovery credentials ready (and tested), and practice small drills monthly. RTO is usually dominated by human coordination and “what do we do next?” decisions.
How often should we run DR tests?
A good cadence is monthly small drills (restore into staging + validate) and quarterly deeper drills (failover, access constraints, realistic scenarios). The right cadence is the one that prevents drift and keeps runbooks accurate.
What should we restore first during an incident?
Restore what enables everything else: tier-0 dependencies (identity/secrets, database, DNS/certificates, network access), then the services that provide the minimal viable user experience, then everything else. Your recovery inventory should encode this order.
Cheatsheet
A scan-fast DR checklist you can keep open during planning and drills. If you’re starting from zero, aim to complete the “Baseline” column first.
| Area | Baseline (start here) | Stronger (when needed) |
|---|---|---|
| Targets | RPO/RTO per service tier | RLA defined + worst-slice targets |
| Data protection | Automated backups + retention + encryption | Cross-domain copies + immutability + continuous replication |
| Recovery process | Runbooks with owners + validation steps | Automation for restore + cutover + smoke tests |
| Testing | Monthly restore drill (staging) + measure RTA | Quarterly failover/game day + access constraints |
| Operations | Alert on backup failures and stale backups | Metrics dashboard + drill schedule + postmortem actions |
Pre-drill checklist
- Pick a scenario (data corruption, region outage, accidental deletion)
- Confirm runbook location and access (break-glass ready)
- Confirm target environment for restore (isolated)
- Define validation steps (smoke test, key queries, integrity checks)
- Assign roles (incident lead, executor, verifier, scribe)
Post-drill checklist
- Record timings: detect → declare → restore → validate → cutover
- Update the runbook with missing details and command outputs
- Fix the top 1–3 gaps (automation, access, monitoring)
- Re-run the drill for the fixed parts (prove the improvement)
- Publish a short summary so the knowledge spreads
Drill → measure → remove one manual step → drill again. Repeat until your RTA matches your target RTO. This is the “without guessing” part.
Wrap-up
Disaster recovery that works is not a document — it’s a practiced capability. The winning pattern is simple: define realistic RPO/RTO per service, pick a recovery mechanism that matches the tier, and test until you have measured results you trust.
Your next 3 actions
- Today: write RPO/RTO for your top services and list tier-0 dependencies.
- This week: do one restore drill into a safe environment and record the real time (RTA).
- This month: automate the slowest steps and schedule a recurring drill so the system doesn’t rot.
If you want to go deeper on adjacent skills that make DR easier — container builds, deployments you can roll back, and visibility during incidents — the related posts below are good follow-ups.
Quiz
Quick self-check (demo). This quiz is auto-generated for cloud / devops / disaster.