“We have backups” is not a recovery plan. Ransomware doesn’t just encrypt files—it often hunts for backup repositories, deletes restore points, steals data, and leaves you with untrusted systems. This guide is a ransomware reality check: what “good backups” actually mean, how to design them to survive an attacker, and how to test restores so you can recover with confidence (not hope).
Quickstart
If you only have an hour or two, do these high-leverage moves first. The goal is simple: make sure you have at least one backup an attacker can’t erase, and prove you can restore it.
1) Set your recovery targets (RPO/RTO)
Without targets, “backup frequency” is guesswork. Decide what you can afford to lose and how fast you must be back.
- RPO: max data loss you can tolerate (e.g., 4 hours)
- RTO: max downtime you can tolerate (e.g., 8 hours)
- Write targets per system: DB, file shares, SaaS, laptops
- Make tradeoffs explicit (cost vs speed)
2) Enforce 3-2-1-1-0 (the survivable baseline)
“3-2-1” is good. “3-2-1-1-0” is better for ransomware: add an immutable/offline copy and verify.
- 3 copies of data (production + 2 backups)
- 2 different media / storage types
- 1 offsite copy (separate blast radius)
- 1 immutable/offline copy (can’t be deleted)
- 0 errors (verified restores, not assumptions)
3) Split identities: backups must not share admin keys
Ransomware commonly wins by stealing privileged credentials. Backup systems need separate accounts and least privilege.
- Dedicated backup account + MFA
- No domain admin for backup operators
- Backup storage credentials can’t delete (immutability)
- Break-glass account stored offline
4) Run one real restore test this week
The only backup that matters is the one you can restore under pressure. Do one “boring” restore test and document it as a runbook.
- Pick a representative system (DB or file share)
- Restore into an isolated environment
- Validate with checksums/app-level health checks
- Record time + steps + missing dependencies
If an attacker with your admin credentials can delete every restore point, you don’t have a ransomware-ready backup. You have a convenience copy.
Overview
Ransomware is a business problem disguised as malware: attackers aim to stop operations and pressure you into paying. Backups are your best “no” — but only if they’re designed for an adversary who can log in, not just a disk that fails.
What this post covers
- What “good backups” mean in a ransomware threat model (not just hardware failure)
- How to design survivable backup architecture (immutability, separation, offsite)
- How to protect backup identities and stop attackers from deleting restore points
- How to test restores and verify integrity (so recovery isn’t a surprise)
- A cheatsheet + a quiz to lock in the concepts
| Backup capability | What it protects against | Common failure mode |
|---|---|---|
| Versioned backups | Accidental deletes, corruption, some ransomware | Too-short retention; encrypted files replace clean versions |
| Offsite copy | Site disaster, local compromise blast radius | Offsite is still reachable with stolen credentials |
| Immutability / WORM | Credentialed attackers deleting backups | Misconfigured delete permissions; “immutable” not enforced |
| Restore testing | Unknown unknowns (missing keys, bad scripts) | Tests are skipped; only “backup jobs succeeded” is monitored |
As you read, imagine: “An attacker has admin access on Friday night.” For each step, ask: Can they erase my last good copy? If yes, adjust.
Core concepts
1) The ransomware backup threat model (why “it’s on the network” matters)
Traditional backups were designed for accidents: disk failure, bad deployments, human mistakes. Ransomware adds an adversary who tries to destroy recovery. That usually looks like:
- Steal privileged credentials (phishing, token theft, lateral movement)
- Disable backup agents or delete jobs/snapshots
- Delete or encrypt backup repositories and catalogs
- Exfiltrate data to add pressure (double extortion)
The key insight: if backups are writable with day-to-day admin credentials, they’re part of the same failure domain.
2) RPO and RTO: the two numbers that shape everything
Backups are about outcomes. RPO and RTO translate “security” into “operations”.
RPO (Recovery Point Objective)
How much data you can lose. If your RPO is 4 hours, you need restore points at least every 4 hours.
- Databases: often minutes–hours
- File shares: hours–days (depends)
- SaaS: check provider limitations
RTO (Recovery Time Objective)
How long you can be down. If your RTO is 8 hours, you must be able to restore and validate within 8 hours.
- Consider dependency chains (DNS, identity, network)
- Test on realistic hardware and bandwidth
- Document the slow steps (they define RTO)
3) 3-2-1-1-0: the backup principle that survives attackers
3-2-1 is a solid baseline. Ransomware pushes you to add “1-0”: one immutable/offline copy, and zero errors verified by restore tests.
What “immutable” means (and what it doesn’t)
| Term | Practical meaning | Gotcha |
|---|---|---|
| Immutability (WORM) | Write once; cannot delete/overwrite until retention expires | Admins can still break it if configured wrong |
| Air gap | Not continuously reachable from production network | “Same cloud account” is not an air gap |
| Offline | Disconnected media or vault access only during backup window | Human process risk (someone forgets to disconnect) |
4) Separation: identities, networks, and blast radius
Ransomware resilience is mostly about separation: making sure compromise of one area doesn’t grant control over everything.
Identity separation
- Dedicated backup operator accounts
- MFA everywhere (including backup console)
- Break-glass credentials stored offline
- Service accounts with minimal scope
Network & storage separation
- Backup network/VLAN isolated from user endpoints
- Backup repositories not domain-joined (when possible)
- Separate accounts/tenants for offsite storage
- Immutable bucket/vault policies enforced server-side
Storage snapshots are great for fast rollback, but many live in the same admin plane and can be deleted by a credentialed attacker. Treat snapshots as a speed layer, not the final safety net.
Step-by-step
This section is a practical build plan. You can apply it to a home lab, a startup stack, or a mid-size org. The names of tools vary, but the architecture and habits stay the same.
Step 1 — Inventory what you must restore (and in what order)
In a real incident, you don’t restore “everything at once”. You restore capabilities: identity, core services, apps, then endpoints. Start by listing:
- Tier 0: identity, DNS, certificates, secrets, core networking
- Tier 1: databases, storage, message queues, key internal apps
- Tier 2: file shares, collaboration tools, secondary services
- Tier 3: endpoints, dev boxes, “nice-to-have” systems
This tiering prevents the classic failure: restoring an application before the identity or database it depends on.
Step 2 — Set RPO/RTO per tier and choose a backup method that can meet it
Different systems need different strategies. A database usually wants point-in-time recovery; a static file share may not. Use this mapping as a starting point:
| System type | Typical approach | Why it works |
|---|---|---|
| Databases | Full + incremental + log shipping / PITR | Fine-grained restore points, consistent recovery |
| VMs / servers | Image-based backups + config export | Fast “whole system” restore, easy rebuild |
| File shares | Versioned file backups + immutable copy | Recovers before encryption; handles deletes |
| SaaS (email, docs) | Provider exports or third-party backup | Protects against account takeover and retention limits |
| Kubernetes / IaC | Git as source of truth + periodic cluster state backup | Rebuild infra quickly; restore only stateful pieces |
Step 3 — Build a two-layer recovery architecture (fast layer + survivable layer)
A useful mental model is two layers:
Layer A — Fast recovery (operational convenience)
Snapshots and local backup repositories that get you back quickly for normal incidents.
- Frequent backups for tight RPO
- Short retention for fast restores
- Local bandwidth = fast recovery
Layer B — Survivable recovery (ransomware safety net)
An offsite, immutable/offline copy that remains intact even during credential compromise.
- Immutable retention enforced server-side
- Separate identity / tenant / account if possible
- Restore procedures tested and documented
Ransomware-ready backups can be slower or more expensive. You still want a fast layer for day-to-day restores — just don’t confuse “fast” with “safe”.
Step 4 — Harden the backup plane (make deletion hard)
Most backup failures during ransomware aren’t “the backup job failed.” They’re: someone logged into the backup console and deleted the history. Harden the backup plane like it’s production.
Identity hardening checklist
- Separate backup admin accounts (not daily admin)
- Strong MFA (phishing-resistant where possible)
- Limit backup console access (jump host / VPN)
- Service accounts can write backups but cannot delete
- Alert on privilege changes and failed MFA
Storage hardening checklist
- Enable immutability/WORM with retention policy
- Separate encryption keys from production admins
- Disable “delete all” paths (policy + technical controls)
- Protect the backup catalog/metadata
- Monitor for mass deletions and unusual access
Step 5 — Automate backups + retention + basic verification
Automation is how you keep backups boring (boring is good). Below is a practical example using a versioned backup tool and an object store as a repository. Adapt the concepts to your platform: schedule, encrypt, retain, and verify.
#!/usr/bin/env bash
set -euo pipefail
# Example: a simple encrypted, versioned backup routine (adapt to your tooling)
# - Initializes repository (once)
# - Runs backup
# - Applies retention ("forget/prune")
# - Performs a lightweight integrity check
#
# Tip: store credentials in a secret manager, not in this file.
export RESTIC_REPOSITORY="s3:https://s3.example.com/my-backups"
export RESTIC_PASSWORD_FILE="/etc/backup/restic.pass"
export AWS_ACCESS_KEY_ID="$(cat /etc/backup/s3_access_key)"
export AWS_SECRET_ACCESS_KEY="$(cat /etc/backup/s3_secret_key)"
HOST_TAG="$(hostname -s)"
DATE="$(date +%F)"
# 1) Initialize repo once (idempotent-ish check)
if ! restic snapshots >/dev/null 2>&1; then
restic init
fi
# 2) Backup critical paths (tune per host)
restic backup \
/etc \
/var/lib \
/srv \
--tag "${HOST_TAG}" \
--tag "daily" \
--exclude-file="/etc/backup/excludes.txt"
# 3) Retention policy (example):
# - keep 7 daily, 4 weekly, 12 monthly snapshots per host
restic forget --tag "${HOST_TAG}" \
--keep-daily 7 \
--keep-weekly 4 \
--keep-monthly 12 \
--prune
# 4) Lightweight verification (spot-check metadata + a small sample)
# Full verify is heavier; schedule it weekly/monthly.
restic check --read-data-subset=1/50
echo "[OK] Backup finished for ${HOST_TAG} on ${DATE}"
Encryption protects confidentiality. It does not stop a credentialed attacker from deleting your repository. Immutability and separation are the protections against deletion.
Step 6 — Keep backup policy in a config file (so it’s reviewable)
Teams often “configure backups” in a UI and never write down what they did. Treat backup policy like code: it should be readable, reviewable, and versioned.
# Example: a readable backup policy file you can version-control (conceptual)
# Use this as a template even if your tool uses a different format.
backup_policy:
name: "core-servers"
schedule:
daily: "02:15"
weekly: "Sun 03:10"
retention:
daily: 7
weekly: 4
monthly: 12
yearly: 3
data_scope:
include:
- "/etc"
- "/srv"
- "/var/lib"
exclude:
- "/var/lib/docker"
- "/var/tmp"
- "**/*.iso"
security:
encryption: true
repository:
type: "object-store"
offsite: true
immutable: true # enforced by storage policy, not just client settings
identities:
backup_operator_mfa: true
writer_cannot_delete: true
verification:
smoke_restore:
frequency: "weekly"
target: "isolated-restore-vm"
checks:
- "checksum_sample"
- "app_health_check"
full_integrity_check:
frequency: "monthly"
Step 7 — Test restores like a fire drill (and measure your real RTO)
Restore testing has two layers: technical and operational. Technical tests prove bits can be restored. Operational tests prove people can do it under time pressure.
A weekly smoke-restore (30–60 minutes)
- Restore a small sample (a directory, a DB dump)
- Validate with checksums and/or application query
- Record duration + any manual steps
- Update the runbook immediately
A quarterly full scenario (2–6 hours)
- Assume compromised admin creds
- Restore tier order (identity → DB → app)
- Practice “clean room” rebuild of one system
- Verify monitoring/logging after recovery
You can automate parts of this. The script below demonstrates a simple concept: periodically restore a random sample and verify integrity with hashes. Even if you don’t use this exact script, the pattern is the win.
import hashlib
import os
import random
import subprocess
from pathlib import Path
"""
Conceptual restore test:
- Restore a snapshot (or file sample) into an isolated directory
- Compute hashes for a sample of files
- Compare with expected hashes if you have them (or store as "golden" over time)
Adapt to your backup tool by replacing the restore command.
Run in an isolated environment, not on production hosts.
"""
RESTORE_DIR = Path("/tmp/restore_test")
SAMPLE_FILES = 25
MAX_FILE_BYTES = 10 * 1024 * 1024 # skip very large files in smoke tests
def sha256_file(path: Path) -> str:
h = hashlib.sha256()
with path.open("rb") as f:
for chunk in iter(lambda: f.read(1024 * 1024), b""):
h.update(chunk)
return h.hexdigest()
def restore_snapshot(snapshot_id: str) -> None:
# Example placeholder restore command. Replace for your environment/tool.
# e.g., restic restore SNAPSHOT --target /tmp/restore_test
subprocess.check_call(["restic", "restore", snapshot_id, "--target", str(RESTORE_DIR)])
def list_restored_files(root: Path) -> list[Path]:
files: list[Path] = []
for p in root.rglob("*"):
if p.is_file():
try:
if p.stat().st_size <= MAX_FILE_BYTES:
files.append(p)
except FileNotFoundError:
# In case files are moved during listing; ignore
continue
return files
def main() -> None:
RESTORE_DIR.mkdir(parents=True, exist_ok=True)
# Pick a snapshot id (you can also select "latest" with your tool)
snapshot_id = os.environ.get("SNAPSHOT_ID")
if not snapshot_id:
raise SystemExit("Set SNAPSHOT_ID env var to a snapshot you want to smoke-restore.")
# Restore into isolated directory
restore_snapshot(snapshot_id)
# Sample files and hash them
files = list_restored_files(RESTORE_DIR)
if not files:
raise SystemExit("No files restored; check restore command and snapshot contents.")
sample = random.sample(files, k=min(SAMPLE_FILES, len(files)))
print(f"Restored files found: {len(files)} | Hashing sample: {len(sample)}")
for p in sample:
digest = sha256_file(p)
print(f"{digest} {p}")
print("OK: smoke-restore completed (hash sample printed).")
if __name__ == "__main__":
main()
- Where immutable/offsite backups live and who can access them
- Exact restore order (Tier 0 → Tier 3)
- How to rebuild identity/secrets safely (clean room assumptions)
- Validation steps (hash checks, app tests, user sign-off)
- Decision points (when to isolate, when to reimage, when to rotate keys)
Common mistakes
These are the patterns behind “we had backups but still paid” or “recovery took weeks.” Each mistake includes a practical fix you can apply without rebuilding everything.
Mistake 1 — Backups share the same admin plane as production
If production admins can delete backup history, attackers can too (once they steal credentials).
- Fix: separate backup identities, restrict console access, use MFA.
- Fix: enforce immutability/WORM server-side (not just a client checkbox).
Mistake 2 — “Snapshots = backups” with no offsite/immutable layer
Snapshots are fast, but often deletable. Ransomware makes “deletable” a deal-breaker.
- Fix: keep snapshots for speed, add an immutable offsite copy for safety.
- Fix: test that old restore points remain accessible after an “admin compromise” scenario.
Mistake 3 — No restore testing (only “job success” monitoring)
Backup logs say the job ran. They don’t say you can restore, decrypt, boot, and validate.
- Fix: weekly smoke-restore + quarterly scenario test.
- Fix: validate with app-level checks, not only file presence.
Mistake 4 — Retention is too short for “slow ransomware”
Some incidents are discovered late. If you only keep 7 days, you may only have encrypted versions left.
- Fix: keep multiple horizons: daily/weekly/monthly.
- Fix: protect longer retention in the immutable/offsite layer.
Mistake 5 — Keys, secrets, and identity aren’t backed up safely
You can restore servers and still be stuck if you can’t restore certificates, secrets, or identity.
- Fix: treat secrets and identity as Tier 0; back them up with strict access controls.
- Fix: keep offline break-glass recovery materials (documented, audited).
Mistake 6 — Backup tooling runs on untrusted endpoints
If the backup agent or credentials live on compromised machines, attackers can use them.
- Fix: isolate backup infrastructure and limit outbound credentials on endpoints.
- Fix: monitor for unusual backup activity (mass deletes, unusual restore attempts).
First, make sure you have an immutable/offline copy. Second, test one restore end-to-end. Those two actions beat almost any “more backups” project.
FAQ
Do backups stop ransomware?
No—backups don’t prevent infection. Backups prevent ransomware from becoming a business-ending event by letting you restore without paying. Pair backups with prevention (patching, MFA, EDR) and detection (alerts, logging) for real resilience.
What’s the difference between “offsite” and “air-gapped” backups?
Offsite means geographically/separately hosted; air-gapped means not continuously reachable. Many “offsite” backups are still reachable via the same cloud account and credentials. Air gap is about access path and blast radius, not distance.
Are cloud backups automatically safe from ransomware?
Not automatically. Cloud storage can be deleted or overwritten if an attacker has the right permissions. The safety comes from immutability (WORM/object lock), least-privilege identities, and separating backup access from daily admin access.
How often should we test restores?
Weekly for a small smoke test and quarterly for a full scenario is a practical baseline. If your systems are high-change or high-stakes, increase frequency. The goal is to keep restore steps fresh and catch silent failures early.
How long should we keep backup retention to handle late discovery?
Use multiple horizons: daily for quick rollback, weekly/monthly for late discovery, and a longer immutable archive for worst-case scenarios. Retention should reflect how quickly you can detect compromise and how long you need for compliance.
What should we restore first after a ransomware event?
Restore Tier 0 first: identity, DNS, secrets, certificates, and core networking—then databases and critical apps. Restoring apps before identity and data usually creates a mess (and can reintroduce compromise).
The “backup question” is really: Can we restore clean systems and trusted data quickly enough to survive? Everything else is details.
Cheatsheet
A scan-fast checklist for ransomware-ready backups. Print it, paste it into a ticket, or turn it into your internal standard.
Backups that actually save you
- RPO/RTO defined per system (not “weekly for everything”)
- 3-2-1-1-0 implemented (immutable/offline + verified)
- Offsite copy in a separate blast radius
- Backup repository deletion is blocked (server-side policy)
- Backup identities separated and MFA-protected
- Restore runbooks written and tested
Restore drill: minimum viable plan
- Weekly: smoke-restore a random sample
- Monthly: full integrity check (heavier verification)
- Quarterly: scenario restore (assume compromised admin creds)
- Measure actual time, not estimated time (real RTO)
- Validate with app checks + checksums
- Update docs immediately after each test
“Before you call it done” checklist
| Question | “Good” looks like |
|---|---|
| Can an attacker with admin creds delete our last 30 days of backups? | No (immutability/WORM + separation enforced) |
| Do we have at least one offsite copy? | Yes (separate account/tenant if possible) |
| Have we restored a representative system end-to-end? | Yes (documented steps + validation) |
| Do we know restore order and dependencies? | Yes (Tier 0 → Tier 3 runbook) |
| Do we have a break-glass path if identity is down? | Yes (offline, audited, tested) |
Having “lots of backups” but no immutable layer and no restore tests is how organizations end up paying. Fix survivability and testing first; then optimize speed.
Wrap-up
Ransomware-ready backups aren’t about buying a bigger storage box. They’re about designing for an attacker who can log in. If you remember one thing: survivable backups require immutability, separation, and restore testing.
Your next actions (in order)
- Pick targets: write RPO/RTO for your top 5 systems.
- Make one copy undeletable: implement an immutable/offline backup layer.
- Split identities: separate backup access from daily admin access.
- Prove recovery: run one end-to-end restore and document the runbook.
- Repeat: schedule weekly smoke-restores and quarterly scenario tests.
If you want to level up beyond backups, the next steps are about reducing initial compromise and limiting blast radius: threat modeling your environment, hardening authentication, and building a DevSecOps pipeline that prevents risky changes. UniLab has related guides to help:
- Threat Modeling in 45 Minutes: A Lightweight Template
- Passkeys, MFA, Sessions: Modern Authentication Done Right
- DevSecOps Basics: Add Security to CI/CD Without Chaos
Backups become “boring”: automated, monitored, immutable, and routinely restored. When ransomware hits, you execute a practiced playbook instead of improvising.
Quiz
Quick self-check (demo). This quiz is auto-generated for cyber / security / ransomware.