Most incidents don’t go “bad” because the team is lazy—they go bad because nobody knows what “good” looks like in the first 30 minutes. An incident response plan turns chaos into a repeatable playbook: who owns the call, how you communicate, what you do first, and how you learn after. This post gives you a practical plan structure plus a copy/paste template you can adapt in an afternoon.
Quickstart
If you do nothing else, do these steps. They create clarity fast: ownership, communication, severity, and a “first 15 minutes” routine. You can run this with a 2-person startup team or a larger org.
1) Pick an Incident Commander (IC) for every incident
One person owns coordination. Everyone else focuses on technical work. This prevents “too many cooks” and missed handoffs.
- Choose an IC rotation (on-call or “whoever is primary”)
- IC runs the timeline, decisions, and updates
- IC can delegate tasks—but keeps ownership
2) Create a simple severity matrix (S0–S3)
Severity defines urgency, required roles, and update cadence. Without it, everything becomes “urgent” and nothing is.
- Define 4 levels with concrete examples
- Set update cadence (e.g., every 15/30/60 minutes)
- Write escalation rules (who must be paged)
3) Create a single incident channel + a status template
One thread for facts and decisions. No scattered DMs. Your future self will thank you during the postmortem.
- Create #incidents (or similar) and pin the template
- Assign a scribe to capture timeline + decisions
- Decide where external updates live (status page, email, etc.)
4) Write the “first 15 minutes” checklist
The first minutes should be boring and consistent: stabilize, preserve evidence, and reduce blast radius.
- Confirm what’s happening (impact, scope, time)
- Stop the bleeding (containment)
- Preserve logs/snapshots before changes
Your first incident response plan should fit on a few pages. If it feels like policy, people won’t use it. Start small, run one tabletop exercise, then refine.
Overview
An incident response plan is a lightweight system for making good decisions under pressure. It answers the questions teams always ask mid-incident: Who’s in charge? What’s the priority? What do we say to customers? What do we do first?
What you’ll build in this post
| Plan component | What it contains | Why it matters |
|---|---|---|
| Roles & ownership | Incident Commander, Tech Lead, Comms, Scribe | Prevents coordination failure and duplicated work |
| Severity & escalation | S0–S3 definitions, paging rules, cadence | Aligns urgency and keeps leaders informed |
| Communication | Channels, status template, external update path | Stops rumor-based decisions and “lost context” |
| Runbooks | First 15 minutes checklist + common scenarios | Makes response repeatable and faster |
| Evidence & learning | Logging/snapshots, postmortems, action items | Preserves forensic data and prevents repeats |
You don’t need a dedicated security team to get real value here. Even basic clarity (who owns the call, where updates go, and what “done” means) reduces downtime, reduces risk, and reduces the emotional load on the team.
Use the same process for outages, data loss, suspicious access, credential leaks, ransomware, degraded performance, and anything that threatens confidentiality, integrity, or availability.
Core concepts
Before the template, align on a few concepts. This is the “mental model” that keeps your plan simple and usable.
Incident vs event
An event is something observable (alert fired, latency spike, suspicious login). An incident is an event (or set of events) that creates real impact or credible risk. Your plan should define when you “declare an incident” and start the incident process.
Declare an incident when…
- Customers are impacted or data may be exposed
- Core services are down/degraded beyond SLA
- You have active exploitation or high-confidence compromise
- The team needs coordinated response (not just one person debugging)
Don’t wait for certainty
The purpose of declaring is to start coordination and evidence preservation early. You can always downgrade severity later. Waiting often costs you the timeline and the logs.
Lifecycle: contain → eradicate → recover (in that order)
In security incidents especially, the “fix” is not one step. You typically move through:
- Containment: stop ongoing damage (block, isolate, revoke, rate limit, disable)
- Eradication: remove the root cause (patch, rotate secrets, remove persistence)
- Recovery: restore safe service (re-deploy, validate, monitor)
- Learning: document, fix systemic gaps, prevent recurrence
Rapid changes can overwrite logs or remove artifacts you need later. Plan for preserve-then-change: snapshot critical systems/logs (when feasible) before you rotate keys, terminate instances, or wipe machines.
Roles: one owner, many helpers
Most incident response pain is coordination pain. Roles reduce that. A minimal set looks like:
| Role | Responsibilities | Common failure if missing |
|---|---|---|
| Incident Commander (IC) | Runs the response: priorities, decisions, handoffs, updates | Everyone debugs, nobody coordinates |
| Tech Lead | Owns technical investigation and containment plan | Changes happen without a coherent strategy |
| Comms Lead | Internal/external updates, stakeholder alignment | Rumors, inconsistent messaging, angry customers |
| Scribe | Timeline, actions taken, decisions and rationale | No postmortem signal, repeated mistakes |
Severity should drive behavior
Severity isn’t a moral judgement. It’s a routing mechanism. Your severity matrix should decide: who is paged, how often you update, and what “success” means.
A practical severity matrix (example)
| Level | Impact / risk | Expected response |
|---|---|---|
| S0 | Major outage or confirmed sensitive data exposure | All hands, exec notify, updates every 15 min, preserve evidence |
| S1 | Significant customer impact or active exploitation suspected | Dedicated team, updates every 30 min, strong containment focus |
| S2 | Partial degradation or localized compromise risk | On-call + support, updates every 60 min, investigate/mitigate |
| S3 | Minor issue, no customer impact, low risk | Ticket + follow-up, capture learnings, no paging beyond on-call |
Step-by-step
This section walks you through creating a first incident response plan that’s realistic, lightweight, and usable. Treat this as a working document: version it, run drills, and improve it after each real incident.
Step 1 — Define scope and “what counts as an incident”
Don’t start by writing a long PDF. Start by defining the situations where you want a predictable response. You can always expand later.
- Systems: production app, databases, auth, CI/CD, cloud accounts, laptops, third-party SaaS
- Incident types: outage, data exposure, malware, credential leak, suspicious admin activity, DDoS
- Trigger: when do you “declare” and open an incident channel?
Step 2 — Assign roles and escalation paths
You can run this with a small team: the important part is to explicitly assign the role for each incident, even if one person holds multiple roles in a pinch.
Minimum viable roles
- IC: coordination + decision owner
- Tech Lead: investigation + mitigation plan
- Scribe: timeline + actions
- Comms: stakeholder + customer messaging (can be the IC in small teams)
Escalation rules
- When to page security/infra leadership (S0/S1)
- When to engage legal/compliance (data exposure, regulated data)
- When to contact vendors/cloud support (platform outages, account compromise)
- Who can approve risky mitigations (traffic blocks, feature shutdowns)
If you want a clean “template included” artifact, keep a small config-like document with your roles, channels, severity and cadences. Here’s a copy/paste starter you can adapt.
# incident-response-plan.yaml (starter template)
version: 1
owners:
primary_oncall: "oncall@company.com"
incident_commander_rotation: "PagerDuty: IC Rotation"
channels:
incident_room: "#incidents"
exec_updates: "#exec-updates"
security_room: "#security"
external_status: "Status Page + Email"
severity:
S0:
definition: "Major outage or confirmed sensitive data exposure"
notify: ["CTO", "Security Lead", "Legal (if data)", "Support Lead"]
update_cadence_minutes: 15
S1:
definition: "Significant impact or suspected active exploitation"
notify: ["Engineering Lead", "Security Lead", "Support Lead"]
update_cadence_minutes: 30
S2:
definition: "Partial degradation or localized compromise risk"
notify: ["On-call", "Service Owner"]
update_cadence_minutes: 60
S3:
definition: "Minor issue, no customer impact"
notify: ["Service Owner"]
update_cadence_minutes: 0
roles:
incident_commander:
responsibilities:
- "Declare incident + assign severity"
- "Coordinate people, decisions, and updates"
- "Keep a single source of truth"
tech_lead:
responsibilities:
- "Investigate root cause and propose mitigations"
- "Run containment/eradication/recovery plan"
scribe:
responsibilities:
- "Capture timeline, actions, decisions, and links"
comms_lead:
responsibilities:
- "Draft customer/stakeholder updates"
- "Ensure messaging is consistent and approved"
first_15_minutes:
- "Open incident channel; assign IC + scribe + tech lead"
- "State impact, suspected scope, and start time"
- "Preserve logs/snapshots before big changes"
- "Contain: revoke/disable/isolate as needed"
- "Set next update time"
Put this in a repo next to your infrastructure docs (or an internal wiki) and treat changes like code: review them, track versions, and announce updates to the team.
Step 3 — Create your incident communications routine
During an incident, communication is a technical task: it prevents duplicate work, it aligns stakeholders, and it reduces panic. Your plan should define where updates go and what the update format is.
Internal update format (recommended)
- What happened: factual summary (no speculation)
- Impact: who/what is affected, how bad
- What we’re doing: current mitigation steps
- Next update: a specific timestamp
- Asks: who needs to help, what is needed
External update principles
- Be accurate before being fast (but don’t go silent)
- State what you know, what you’re investigating, and ETA for next update
- Avoid technical blame or detailed exploit info mid-incident
- Use a single publishing path (status page, email) with approvals
If you publish updates via an API (status system, incident tooling, internal dashboards), standardize the payload. That way the Comms Lead doesn’t invent a new format every time.
{
"incident_id": "INC-2026-00123",
"severity": "S1",
"status": "investigating",
"summary": "We are investigating elevated 5xx errors affecting the API and dashboard.",
"impact": {
"customers_affected": "some",
"regions": ["eu-central", "us-east"],
"start_time_utc": "2026-01-09T13:58:00Z"
},
"current_actions": [
"Mitigating by rolling back the latest deployment",
"Increasing rate limits on a safe endpoint to reduce load"
],
"next_update_utc": "2026-01-09T14:45:00Z",
"links": {
"incident_channel": "#incidents",
"dashboard": "https://monitoring.example.internal/d/abc123"
}
}
Step 4 — Write the “first 15 minutes” runbook (stabilize first)
Your first 15 minutes are about stopping the bleeding and preserving evidence. Deep root-cause work can come later, once the situation is stable and the team is organized.
First 15 minutes checklist (printable)
- Declare: open incident channel, assign IC + Tech Lead + Scribe
- Assess: impact, scope, start time, and “what changed” recently
- Preserve: save key logs/snapshots; avoid “wipe and hope”
- Contain: block/disable/isolate/revoke to stop damage
- Communicate: first update + next update time
- Decide: short-term mitigation strategy (rollback, feature flag, key rotation)
For security incidents, you often need quick triage commands to answer basic questions: what changed, who logged in, what processes are running, and what network connections exist. Here’s a minimal, safe starter you can adapt to your environment.
#!/usr/bin/env bash
# first-15-min-triage.sh
# Goal: collect quick context without destroying evidence.
# Run as appropriate for your environment and permissions.
set -euo pipefail
echo "[*] Timestamp"
date -u
echo "[*] Basic system info"
uname -a || true
uptime || true
echo "[*] Recent logins (Linux)"
who || true
last -n 20 || true
echo "[*] Running processes (top offenders first)"
ps aux --sort=-%cpu | head -n 15 || true
echo "[*] Network connections"
ss -tulpn || netstat -tulpn || true
echo "[*] Listening ports"
ss -ltnp || true
echo "[*] Recent auth/system logs (best effort)"
journalctl -n 200 --no-pager || true
echo "[*] Container context (if present)"
docker ps 2>/dev/null || true
kubectl get pods -A 2>/dev/null || true
echo "[*] NOTE: Preserve logs/snapshots before rebooting or terminating instances."
Blocking IPs, disabling accounts, rotating keys, or shutting down services can stop damage—but it can also impact legitimate users. Your plan should define who can approve high-impact mitigations and how you communicate the trade-off.
Step 5 — Build a small set of scenario runbooks
Don’t try to runbook everything. Start with 3–6 scenarios that match your risks and history. Each runbook should fit on one page and include: detection signals, containment options, verification steps, and rollback plans.
Good starter scenarios
- Suspicious admin login / cloud account compromise
- Credential leak (API key, database password, OAuth client)
- Ransomware / malware on a workstation
- Data exposure via misconfigured bucket or access policy
- Production outage after deployment
- DDoS or abusive traffic spike
What each runbook should answer
- How do we confirm it’s real (signals)?
- What’s the safest containment step?
- How do we know containment worked?
- What evidence should we preserve?
- When do we escalate to legal/compliance/vendors?
Step 6 — Define “done” and post-incident learning
“Incident resolved” should mean more than “alerts stopped.” Define completion criteria so you don’t miss the hard parts (like key rotation, customer communication, and long-term fixes).
Resolution criteria (example)
- Impact eliminated and stable for a defined window (e.g., 30–60 minutes)
- Containment actions verified (no ongoing suspicious activity)
- Temporary mitigations documented (feature flags, blocks) with owners to remove later
- Customer/stakeholder updates sent (when applicable)
- Postmortem scheduled within 48–72 hours
Focus on systemic improvements: gaps in logging, unclear ownership, missing alerts, risky defaults, weak reviews. Assign action items with owners and due dates. That’s how incident response improves over time.
Common mistakes
These are the patterns behind “we handled it, but it felt terrible” or “we fixed it, but we’re not sure what happened.” Each mistake includes a practical fix you can apply without adding bureaucracy.
Mistake 1 — No single owner (everyone is “helping”)
When nobody owns coordination, updates stop, tasks duplicate, and decisions drift.
- Fix: assign an Incident Commander every time.
- Fix: IC runs cadence: “what we know / what we’re doing / next update”.
Mistake 2 — Severity is vibes, not a matrix
If severity isn’t defined, you’ll under-react to real risk and over-react to minor issues.
- Fix: define S0–S3 with examples and behaviors (paging + cadence).
- Fix: allow upgrading/downgrading with a short note in the timeline.
Mistake 3 — Scattered communication (DMs everywhere)
Important facts get lost, decisions are repeated, and new responders can’t catch up.
- Fix: one incident channel + pinned status template.
- Fix: assign a scribe and keep a running timeline.
Mistake 4 — “Fix first, preserve later”
Rapid changes can overwrite logs and destroy forensic artifacts you need to prove what happened.
- Fix: add “preserve evidence” to the first 15 minutes checklist.
- Fix: snapshot/collect logs before reboots, wipes, or instance termination when feasible.
Mistake 5 — No “done” definition (incidents linger)
You stop paging, but the risky temporary mitigations remain for weeks.
- Fix: define resolution criteria and capture temporary changes as action items.
- Fix: schedule the postmortem before you close the incident.
Mistake 6 — Postmortems that produce no change
A meeting without concrete actions is just storytelling.
- Fix: 3–7 action items max, each with owner + due date.
- Fix: prioritize fixes that reduce time-to-detect and time-to-contain.
If a new team member can’t join an incident and understand the state within 5 minutes by reading the channel, your process needs a stronger timeline and status format.
FAQ
What should an incident response plan include at minimum?
At minimum: roles (IC + Tech Lead + Scribe), a severity matrix, a single communication channel, a first 15 minutes checklist, and a postmortem process. If you have those, you can respond consistently and improve over time.
Who should be the Incident Commander?
The IC should be someone who can stay calm, coordinate people, and make decisions—not necessarily the most senior engineer. In many teams, the on-call engineer starts as IC and can hand off to another IC as the incident grows.
How do I choose severity levels without overcomplicating it?
Use 4 levels (S0–S3) and define them by impact and risk, then attach behaviors: who gets paged, how often you update, and whether exec/legal are notified. Keep examples in the doc so people can classify quickly.
How often should we send updates during an incident?
Tie cadence to severity. A common approach: S0 every 15 minutes, S1 every 30 minutes, S2 hourly, S3 as needed. Even if the update is “still investigating,” the cadence reduces anxiety and prevents stakeholders from interrupting the team.
Should we always rotate credentials during a suspected breach?
Not always immediately. Rotating secrets can stop an attacker, but it can also break systems and erase traces. A good plan balances speed and evidence: preserve critical logs/snapshots when feasible, then rotate the highest-risk secrets first (admin accounts, API keys, CI/CD credentials), and verify that the rotation actually took effect.
How do we practice incident response without real incidents?
Run a tabletop exercise every quarter: pick one scenario, simulate the first hour, and practice declaring severity, writing updates, making containment decisions, and capturing a timeline. The output should be plan improvements and a cleaner checklist.
What’s the best format: doc, wiki, or repo?
Use whatever your team will actually open during an incident. Many teams keep a short wiki page for accessibility and a versioned repo for templates/runbooks. The key is that the plan is easy to find, easy to update, and used in drills.
Cheatsheet
Use this as a fast reference. Pin it in your incident channel, print it, or put it in your on-call runbook.
Declare & organize (2 minutes)
- Open incident channel and name the incident
- Assign roles: IC, Tech Lead, Scribe, Comms
- Set severity (S0–S3) and next update time
- State impact and known symptoms (facts only)
First 15 minutes (stabilize)
- Confirm scope: what’s broken, who’s affected, since when
- Preserve evidence: logs/snapshots before big changes
- Contain: isolate, revoke, block, disable, roll back
- Communicate: update cadence and stakeholder routing
Internal update template
- Status: investigating / mitigating / monitoring / resolved
- Impact: what users see, what services affected
- Hypothesis: (optional) labeled as hypothesis
- Actions: what changed, what’s next
- Next update: exact time
Resolution & learning
- Verify stability for a defined window
- Document temporary mitigations and assign cleanup
- Send final update (internal/external as needed)
- Schedule postmortem (48–72h) and track action items
Severity quick map (copy into the plan)
| Severity | When to use | Cadence | Escalation |
|---|---|---|---|
| S0 | Major outage or confirmed sensitive data exposure | 15 min | Exec + Security + Legal (as needed) |
| S1 | Significant impact or suspected active exploitation | 30 min | Engineering lead + Security + Support lead |
| S2 | Partial degradation or localized compromise risk | 60 min | On-call + service owner |
| S3 | Minor issue, low risk | As needed | Service owner |
Put a single “Incident Response” link in your docs homepage, pin it in your on-call channel, and include it in onboarding. The best plan is the one people can find in 10 seconds.
Wrap-up
A good incident response plan doesn’t make incidents disappear—it makes your response predictable. That predictability is what reduces downtime, limits damage, and protects the humans doing the work.
Your next best steps:
- Copy the YAML starter template and fill in your real roles/channels.
- Write your severity matrix and pin it where on-call lives.
- Run a 30-minute tabletop exercise this week using one realistic scenario.
- After your next incident, update the plan with what you learned (especially edge cases and comms).
If your plan feels heavy, it’s too heavy. Your first version should be short enough to read during an incident. The goal is adoption—not perfection.
Quiz
Quick self-check (demo). This quiz is auto-generated for cyber / security / incident.