Your First Incident Response Plan (Template Included)

By Samuel Labant Published Jan 9, 2026 Updated Jan 9, 2026

Most incidents don’t go “bad” because the team is lazy—they go bad because nobody knows what “good” looks like in the first 30 minutes. An incident response plan turns chaos into a repeatable playbook: who owns the call, how you communicate, what you do first, and how you learn after. This post gives you a practical plan structure plus a copy/paste template you can adapt in an afternoon.

Quickstart

If you do nothing else, do these steps. They create clarity fast: ownership, communication, severity, and a “first 15 minutes” routine. You can run this with a 2-person startup team or a larger org.

1) Pick an Incident Commander (IC) for every incident

One person owns coordination. Everyone else focuses on technical work. This prevents “too many cooks” and missed handoffs.

Choose an IC rotation (on-call or “whoever is primary”)
IC runs the timeline, decisions, and updates
IC can delegate tasks—but keeps ownership

2) Create a simple severity matrix (S0–S3)

Severity defines urgency, required roles, and update cadence. Without it, everything becomes “urgent” and nothing is.

Define 4 levels with concrete examples
Set update cadence (e.g., every 15/30/60 minutes)
Write escalation rules (who must be paged)

3) Create a single incident channel + a status template

One thread for facts and decisions. No scattered DMs. Your future self will thank you during the postmortem.

Create #incidents (or similar) and pin the template
Assign a scribe to capture timeline + decisions
Decide where external updates live (status page, email, etc.)

4) Write the “first 15 minutes” checklist

The first minutes should be boring and consistent: stabilize, preserve evidence, and reduce blast radius.

Confirm what’s happening (impact, scope, time)
Stop the bleeding (containment)
Preserve logs/snapshots before changes

A tiny plan beats a perfect plan

Your first incident response plan should fit on a few pages. If it feels like policy, people won’t use it. Start small, run one tabletop exercise, then refine.

Overview

An incident response plan is a lightweight system for making good decisions under pressure. It answers the questions teams always ask mid-incident: Who’s in charge? What’s the priority? What do we say to customers? What do we do first?

What you’ll build in this post

Plan component	What it contains	Why it matters
Roles & ownership	Incident Commander, Tech Lead, Comms, Scribe	Prevents coordination failure and duplicated work
Severity & escalation	S0–S3 definitions, paging rules, cadence	Aligns urgency and keeps leaders informed
Communication	Channels, status template, external update path	Stops rumor-based decisions and “lost context”
Runbooks	First 15 minutes checklist + common scenarios	Makes response repeatable and faster
Evidence & learning	Logging/snapshots, postmortems, action items	Preserves forensic data and prevents repeats

You don’t need a dedicated security team to get real value here. Even basic clarity (who owns the call, where updates go, and what “done” means) reduces downtime, reduces risk, and reduces the emotional load on the team.

Incidents aren’t only breaches

Use the same process for outages, data loss, suspicious access, credential leaks, ransomware, degraded performance, and anything that threatens confidentiality, integrity, or availability.

Core concepts

Before the template, align on a few concepts. This is the “mental model” that keeps your plan simple and usable.

Incident vs event

An event is something observable (alert fired, latency spike, suspicious login). An incident is an event (or set of events) that creates real impact or credible risk. Your plan should define when you “declare an incident” and start the incident process.

Declare an incident when…

Customers are impacted or data may be exposed
Core services are down/degraded beyond SLA
You have active exploitation or high-confidence compromise
The team needs coordinated response (not just one person debugging)

Don’t wait for certainty

The purpose of declaring is to start coordination and evidence preservation early. You can always downgrade severity later. Waiting often costs you the timeline and the logs.

Lifecycle: contain → eradicate → recover (in that order)

In security incidents especially, the “fix” is not one step. You typically move through:

Containment: stop ongoing damage (block, isolate, revoke, rate limit, disable)
Eradication: remove the root cause (patch, rotate secrets, remove persistence)
Recovery: restore safe service (re-deploy, validate, monitor)
Learning: document, fix systemic gaps, prevent recurrence

Don’t destroy evidence while “fixing”

Rapid changes can overwrite logs or remove artifacts you need later. Plan for preserve-then-change: snapshot critical systems/logs (when feasible) before you rotate keys, terminate instances, or wipe machines.

Roles: one owner, many helpers

Most incident response pain is coordination pain. Roles reduce that. A minimal set looks like:

Role	Responsibilities	Common failure if missing
Incident Commander (IC)	Runs the response: priorities, decisions, handoffs, updates	Everyone debugs, nobody coordinates
Tech Lead	Owns technical investigation and containment plan	Changes happen without a coherent strategy
Comms Lead	Internal/external updates, stakeholder alignment	Rumors, inconsistent messaging, angry customers
Scribe	Timeline, actions taken, decisions and rationale	No postmortem signal, repeated mistakes

Severity should drive behavior

Severity isn’t a moral judgement. It’s a routing mechanism. Your severity matrix should decide: who is paged, how often you update, and what “success” means.

A practical severity matrix (example)

Level	Impact / risk	Expected response
S0	Major outage or confirmed sensitive data exposure	All hands, exec notify, updates every 15 min, preserve evidence
S1	Significant customer impact or active exploitation suspected	Dedicated team, updates every 30 min, strong containment focus
S2	Partial degradation or localized compromise risk	On-call + support, updates every 60 min, investigate/mitigate
S3	Minor issue, no customer impact, low risk	Ticket + follow-up, capture learnings, no paging beyond on-call

Step-by-step

This section walks you through creating a first incident response plan that’s realistic, lightweight, and usable. Treat this as a working document: version it, run drills, and improve it after each real incident.

Step 1 — Define scope and “what counts as an incident”

Don’t start by writing a long PDF. Start by defining the situations where you want a predictable response. You can always expand later.

Systems: production app, databases, auth, CI/CD, cloud accounts, laptops, third-party SaaS
Incident types: outage, data exposure, malware, credential leak, suspicious admin activity, DDoS
Trigger: when do you “declare” and open an incident channel?

Step 2 — Assign roles and escalation paths

You can run this with a small team: the important part is to explicitly assign the role for each incident, even if one person holds multiple roles in a pinch.

Minimum viable roles

IC: coordination + decision owner
Tech Lead: investigation + mitigation plan
Scribe: timeline + actions
Comms: stakeholder + customer messaging (can be the IC in small teams)

Escalation rules

When to page security/infra leadership (S0/S1)
When to engage legal/compliance (data exposure, regulated data)
When to contact vendors/cloud support (platform outages, account compromise)
Who can approve risky mitigations (traffic blocks, feature shutdowns)

If you want a clean “template included” artifact, keep a small config-like document with your roles, channels, severity and cadences. Here’s a copy/paste starter you can adapt.

# incident-response-plan.yaml (starter template)
version: 1
owners:
  primary_oncall: "oncall@company.com"
  incident_commander_rotation: "PagerDuty: IC Rotation"
channels:
  incident_room: "#incidents"
  exec_updates: "#exec-updates"
  security_room: "#security"
  external_status: "Status Page + Email"
severity:
  S0:
    definition: "Major outage or confirmed sensitive data exposure"
    notify: ["CTO", "Security Lead", "Legal (if data)", "Support Lead"]
    update_cadence_minutes: 15
  S1:
    definition: "Significant impact or suspected active exploitation"
    notify: ["Engineering Lead", "Security Lead", "Support Lead"]
    update_cadence_minutes: 30
  S2:
    definition: "Partial degradation or localized compromise risk"
    notify: ["On-call", "Service Owner"]
    update_cadence_minutes: 60
  S3:
    definition: "Minor issue, no customer impact"
    notify: ["Service Owner"]
    update_cadence_minutes: 0
roles:
  incident_commander:
    responsibilities:
      - "Declare incident + assign severity"
      - "Coordinate people, decisions, and updates"
      - "Keep a single source of truth"
  tech_lead:
    responsibilities:
      - "Investigate root cause and propose mitigations"
      - "Run containment/eradication/recovery plan"
  scribe:
    responsibilities:
      - "Capture timeline, actions, decisions, and links"
  comms_lead:
    responsibilities:
      - "Draft customer/stakeholder updates"
      - "Ensure messaging is consistent and approved"
first_15_minutes:
  - "Open incident channel; assign IC + scribe + tech lead"
  - "State impact, suspected scope, and start time"
  - "Preserve logs/snapshots before big changes"
  - "Contain: revoke/disable/isolate as needed"
  - "Set next update time"

How to use the template

Put this in a repo next to your infrastructure docs (or an internal wiki) and treat changes like code: review them, track versions, and announce updates to the team.

Step 3 — Create your incident communications routine

During an incident, communication is a technical task: it prevents duplicate work, it aligns stakeholders, and it reduces panic. Your plan should define where updates go and what the update format is.

Internal update format (recommended)

What happened: factual summary (no speculation)
Impact: who/what is affected, how bad
What we’re doing: current mitigation steps
Next update: a specific timestamp
Asks: who needs to help, what is needed

External update principles

Be accurate before being fast (but don’t go silent)
State what you know, what you’re investigating, and ETA for next update
Avoid technical blame or detailed exploit info mid-incident
Use a single publishing path (status page, email) with approvals

If you publish updates via an API (status system, incident tooling, internal dashboards), standardize the payload. That way the Comms Lead doesn’t invent a new format every time.

{
  "incident_id": "INC-2026-00123",
  "severity": "S1",
  "status": "investigating",
  "summary": "We are investigating elevated 5xx errors affecting the API and dashboard.",
  "impact": {
    "customers_affected": "some",
    "regions": ["eu-central", "us-east"],
    "start_time_utc": "2026-01-09T13:58:00Z"
  },
  "current_actions": [
    "Mitigating by rolling back the latest deployment",
    "Increasing rate limits on a safe endpoint to reduce load"
  ],
  "next_update_utc": "2026-01-09T14:45:00Z",
  "links": {
    "incident_channel": "#incidents",
    "dashboard": "https://monitoring.example.internal/d/abc123"
  }
}

Step 4 — Write the “first 15 minutes” runbook (stabilize first)

Your first 15 minutes are about stopping the bleeding and preserving evidence. Deep root-cause work can come later, once the situation is stable and the team is organized.

First 15 minutes checklist (printable)

Declare: open incident channel, assign IC + Tech Lead + Scribe
Assess: impact, scope, start time, and “what changed” recently
Preserve: save key logs/snapshots; avoid “wipe and hope”
Contain: block/disable/isolate/revoke to stop damage
Communicate: first update + next update time
Decide: short-term mitigation strategy (rollback, feature flag, key rotation)

For security incidents, you often need quick triage commands to answer basic questions: what changed, who logged in, what processes are running, and what network connections exist. Here’s a minimal, safe starter you can adapt to your environment.

#!/usr/bin/env bash
# first-15-min-triage.sh
# Goal: collect quick context without destroying evidence.
# Run as appropriate for your environment and permissions.

set -euo pipefail

echo "[*] Timestamp"
date -u

echo "[*] Basic system info"
uname -a || true
uptime || true

echo "[*] Recent logins (Linux)"
who || true
last -n 20 || true

echo "[*] Running processes (top offenders first)"
ps aux --sort=-%cpu | head -n 15 || true

echo "[*] Network connections"
ss -tulpn || netstat -tulpn || true

echo "[*] Listening ports"
ss -ltnp || true

echo "[*] Recent auth/system logs (best effort)"
journalctl -n 200 --no-pager || true

echo "[*] Container context (if present)"
docker ps 2>/dev/null || true
kubectl get pods -A 2>/dev/null || true

echo "[*] NOTE: Preserve logs/snapshots before rebooting or terminating instances."

Containment can break things

Blocking IPs, disabling accounts, rotating keys, or shutting down services can stop damage—but it can also impact legitimate users. Your plan should define who can approve high-impact mitigations and how you communicate the trade-off.

Step 5 — Build a small set of scenario runbooks

Don’t try to runbook everything. Start with 3–6 scenarios that match your risks and history. Each runbook should fit on one page and include: detection signals, containment options, verification steps, and rollback plans.

Good starter scenarios

Suspicious admin login / cloud account compromise
Credential leak (API key, database password, OAuth client)
Ransomware / malware on a workstation
Data exposure via misconfigured bucket or access policy
Production outage after deployment
DDoS or abusive traffic spike

What each runbook should answer

How do we confirm it’s real (signals)?
What’s the safest containment step?
How do we know containment worked?
What evidence should we preserve?
When do we escalate to legal/compliance/vendors?

Step 6 — Define “done” and post-incident learning

“Incident resolved” should mean more than “alerts stopped.” Define completion criteria so you don’t miss the hard parts (like key rotation, customer communication, and long-term fixes).

Resolution criteria (example)

Impact eliminated and stable for a defined window (e.g., 30–60 minutes)
Containment actions verified (no ongoing suspicious activity)
Temporary mitigations documented (feature flags, blocks) with owners to remove later
Customer/stakeholder updates sent (when applicable)
Postmortem scheduled within 48–72 hours

Keep postmortems blameless and specific

Focus on systemic improvements: gaps in logging, unclear ownership, missing alerts, risky defaults, weak reviews. Assign action items with owners and due dates. That’s how incident response improves over time.

Common mistakes

These are the patterns behind “we handled it, but it felt terrible” or “we fixed it, but we’re not sure what happened.” Each mistake includes a practical fix you can apply without adding bureaucracy.

Mistake 1 — No single owner (everyone is “helping”)

When nobody owns coordination, updates stop, tasks duplicate, and decisions drift.

Fix: assign an Incident Commander every time.
Fix: IC runs cadence: “what we know / what we’re doing / next update”.

Mistake 2 — Severity is vibes, not a matrix

If severity isn’t defined, you’ll under-react to real risk and over-react to minor issues.

Fix: define S0–S3 with examples and behaviors (paging + cadence).
Fix: allow upgrading/downgrading with a short note in the timeline.

Mistake 3 — Scattered communication (DMs everywhere)

Important facts get lost, decisions are repeated, and new responders can’t catch up.

Fix: one incident channel + pinned status template.
Fix: assign a scribe and keep a running timeline.

Mistake 4 — “Fix first, preserve later”

Rapid changes can overwrite logs and destroy forensic artifacts you need to prove what happened.

Fix: add “preserve evidence” to the first 15 minutes checklist.
Fix: snapshot/collect logs before reboots, wipes, or instance termination when feasible.

Mistake 5 — No “done” definition (incidents linger)

You stop paging, but the risky temporary mitigations remain for weeks.

Fix: define resolution criteria and capture temporary changes as action items.
Fix: schedule the postmortem before you close the incident.

Mistake 6 — Postmortems that produce no change

A meeting without concrete actions is just storytelling.

Fix: 3–7 action items max, each with owner + due date.
Fix: prioritize fixes that reduce time-to-detect and time-to-contain.

A simple test

If a new team member can’t join an incident and understand the state within 5 minutes by reading the channel, your process needs a stronger timeline and status format.

FAQ

What should an incident response plan include at minimum?

At minimum: roles (IC + Tech Lead + Scribe), a severity matrix, a single communication channel, a first 15 minutes checklist, and a postmortem process. If you have those, you can respond consistently and improve over time.

Who should be the Incident Commander?

The IC should be someone who can stay calm, coordinate people, and make decisions—not necessarily the most senior engineer. In many teams, the on-call engineer starts as IC and can hand off to another IC as the incident grows.

How do I choose severity levels without overcomplicating it?

Use 4 levels (S0–S3) and define them by impact and risk, then attach behaviors: who gets paged, how often you update, and whether exec/legal are notified. Keep examples in the doc so people can classify quickly.

How often should we send updates during an incident?

Tie cadence to severity. A common approach: S0 every 15 minutes, S1 every 30 minutes, S2 hourly, S3 as needed. Even if the update is “still investigating,” the cadence reduces anxiety and prevents stakeholders from interrupting the team.

Should we always rotate credentials during a suspected breach?

Not always immediately. Rotating secrets can stop an attacker, but it can also break systems and erase traces. A good plan balances speed and evidence: preserve critical logs/snapshots when feasible, then rotate the highest-risk secrets first (admin accounts, API keys, CI/CD credentials), and verify that the rotation actually took effect.

How do we practice incident response without real incidents?

Run a tabletop exercise every quarter: pick one scenario, simulate the first hour, and practice declaring severity, writing updates, making containment decisions, and capturing a timeline. The output should be plan improvements and a cleaner checklist.

What’s the best format: doc, wiki, or repo?

Use whatever your team will actually open during an incident. Many teams keep a short wiki page for accessibility and a versioned repo for templates/runbooks. The key is that the plan is easy to find, easy to update, and used in drills.

Cheatsheet

Use this as a fast reference. Pin it in your incident channel, print it, or put it in your on-call runbook.

Declare & organize (2 minutes)

Open incident channel and name the incident
Assign roles: IC, Tech Lead, Scribe, Comms
Set severity (S0–S3) and next update time
State impact and known symptoms (facts only)

First 15 minutes (stabilize)

Confirm scope: what’s broken, who’s affected, since when
Preserve evidence: logs/snapshots before big changes
Contain: isolate, revoke, block, disable, roll back
Communicate: update cadence and stakeholder routing

Internal update template

Status: investigating / mitigating / monitoring / resolved
Impact: what users see, what services affected
Hypothesis: (optional) labeled as hypothesis
Actions: what changed, what’s next
Next update: exact time

Resolution & learning

Verify stability for a defined window
Document temporary mitigations and assign cleanup
Send final update (internal/external as needed)
Schedule postmortem (48–72h) and track action items

Severity quick map (copy into the plan)

Severity	When to use	Cadence	Escalation
S0	Major outage or confirmed sensitive data exposure	15 min	Exec + Security + Legal (as needed)
S1	Significant impact or suspected active exploitation	30 min	Engineering lead + Security + Support lead
S2	Partial degradation or localized compromise risk	60 min	On-call + service owner
S3	Minor issue, low risk	As needed	Service owner

Make it discoverable

Put a single “Incident Response” link in your docs homepage, pin it in your on-call channel, and include it in onboarding. The best plan is the one people can find in 10 seconds.

Wrap-up

A good incident response plan doesn’t make incidents disappear—it makes your response predictable. That predictability is what reduces downtime, limits damage, and protects the humans doing the work.

Your next best steps:

Copy the YAML starter template and fill in your real roles/channels.
Write your severity matrix and pin it where on-call lives.
Run a 30-minute tabletop exercise this week using one realistic scenario.
After your next incident, update the plan with what you learned (especially edge cases and comms).

Start small, then iterate

If your plan feels heavy, it’s too heavy. Your first version should be short enough to read during an incident. The goal is adoption—not perfection.

UniLab Editorial

Modern learning notes for practical builders.

Your First Incident Response Plan (Template Included)

Quickstart

1) Pick an Incident Commander (IC) for every incident

2) Create a simple severity matrix (S0–S3)

3) Create a single incident channel + a status template

4) Write the “first 15 minutes” checklist

Overview

What you’ll build in this post

Core concepts

Incident vs event

Declare an incident when…

Don’t wait for certainty

Lifecycle: contain → eradicate → recover (in that order)

Roles: one owner, many helpers

Severity should drive behavior

A practical severity matrix (example)

Step-by-step

Step 1 — Define scope and “what counts as an incident”

Step 2 — Assign roles and escalation paths

Minimum viable roles

Escalation rules

Step 3 — Create your incident communications routine

Internal update format (recommended)

External update principles

Step 4 — Write the “first 15 minutes” runbook (stabilize first)

First 15 minutes checklist (printable)

Step 5 — Build a small set of scenario runbooks

Good starter scenarios

What each runbook should answer

Step 6 — Define “done” and post-incident learning

Resolution criteria (example)

Common mistakes

Mistake 1 — No single owner (everyone is “helping”)

Mistake 2 — Severity is vibes, not a matrix

Mistake 3 — Scattered communication (DMs everywhere)

Mistake 4 — “Fix first, preserve later”

Mistake 5 — No “done” definition (incidents linger)

Mistake 6 — Postmortems that produce no change

FAQ

What should an incident response plan include at minimum?

Who should be the Incident Commander?

How do I choose severity levels without overcomplicating it?

How often should we send updates during an incident?

Should we always rotate credentials during a suspected breach?

How do we practice incident response without real incidents?

What’s the best format: doc, wiki, or repo?

Cheatsheet

Declare & organize (2 minutes)

First 15 minutes (stabilize)

Internal update template

Resolution & learning

Severity quick map (copy into the plan)

Wrap-up

Quiz

Related posts