SRE basics sound abstract until you use them once: SLIs tell you how users experience your service, SLOs tell you what “good” means, and error budgets turn reliability into a shared decision tool instead of an endless argument. This guide explains SLIs, SLOs, and error budgets in plain English, with examples you can apply to any API, website, or internal platform.
Quickstart
If you only do one thing after reading this post, do this: pick one user-critical journey, define one SLI, set one SLO, and start tracking the error budget. The steps below are designed to get you there fast.
1) Pick a single user journey
Choose something real users care about (not “CPU usage”). Examples: “checkout succeeds”, “search returns results”, “login works”, “webhook accepted”.
- Write the journey in one sentence
- Define what “success” means in the response (status code, business outcome, correctness)
- Pick a scope (one service or the full end-to-end path)
2) Define one SLI that matches user experience
A good SLI is measurable and user-facing. Start with one: availability (success rate) or latency.
- Availability: % of successful requests
- Latency: % of requests under a threshold (or a percentile)
- Keep it simple: “good events / total events”
3) Set an SLO with a time window
An SLO is your reliability target for a window (usually 7, 28, or 30 days). Example: 99.9% successful requests over 30 days.
- Choose a window (rolling is usually easier than calendar-based)
- Pick a target you can meet most of the time
- Document exclusions explicitly (planned maintenance, known non-user paths)
4) Convert the SLO into an error budget
Your error budget is the allowed “badness” in the window: 1 − SLO. This is what makes reliability actionable.
- For request-based SLIs: allowed bad requests = (1 − SLO) × total requests
- For time-based “uptime”: allowed downtime = (1 − SLO) × window duration
- Track how fast you’re spending it (burn rate)
Start with one SLO for one journey. Add more only when you can explain how each SLO changes decisions: alerts, deploy pace, capacity work, or incident response.
| SLO (30 days) | Error budget | Max downtime (approx.) | When to use |
|---|---|---|---|
| 99.0% | 1.0% | ~7h 12m | Early/internal tools, non-critical systems |
| 99.9% | 0.1% | ~43m | Common default for customer-facing APIs |
| 99.95% | 0.05% | ~21m 36s | Higher expectations, mature operations |
| 99.99% | 0.01% | ~4m 19s | Only if you can fund the complexity |
Overview
The core promise of SRE isn’t “never have incidents.” It’s this: make reliability a product feature with clear goals, and use data to balance reliability work with shipping new features. That balance is what prevents “heroics” — the late-night firefighting, guessy alerting, and endless debates about what “good uptime” means.
What this post covers
- SLIs, SLOs, and error budgets (definitions + mental model)
- How to pick SLIs that match user experience
- How to set SLOs that teams can actually use
- How error budgets change operational decisions
- Common mistakes (and how to avoid them)
What this post is not
- A vendor-specific monitoring tutorial
- A promise that one number will solve reliability
- A recommendation to set 99.99% on day one
- A replacement for incident response and good runbooks
Teams often start with system metrics (CPU, memory, node health). SRE starts with user-perceived outcomes. That shift is the whole point: reliability measured the way customers actually experience it.
| Term | Plain-English meaning | Example |
|---|---|---|
| SLI | A measurement of reliability from the user’s perspective | % of successful checkout requests |
| SLO | A target for an SLI over a defined time window | 99.9% success over 30 days |
| SLA | A contract or promise to customers (often with penalties) | “We guarantee 99.5% monthly uptime” |
| Error budget | The allowed unreliability in the SLO window | 0.1% bad requests per 30 days |
Core concepts
SRE basics become easy once you separate three things: what you measure (SLI), what you target (SLO), and how you make tradeoffs (error budget). The rest is implementation detail.
SLIs, SLOs, and error budgets: the plain-English trio
SLI (Service Level Indicator)
An SLI is a metric that reflects user experience. It should be stable, measurable, and hard to “game.”
- Good SLIs: success rate, latency, freshness, correctness, durability
- Weak SLIs: CPU usage, pod restarts, “number of errors” without context
- Common shape: good events / total events
SLO (Service Level Objective)
An SLO is the target you aim for, tied to a time window. It’s a tool for planning and prioritization.
- Includes: SLI definition + threshold + window
- Example: “99.9% successful requests over 30 days”
- Not the same as “always on” or “never fails”
Error budget
The amount of unreliability you can “spend” and still meet the SLO. It turns reliability into a budgeted resource.
- Budget = 1 − SLO
- Spending budget fast means you should slow risky changes
- Saving budget means you can ship faster (or accept experiments)
SLA (Service Level Agreement)
Customer-facing promise. SLAs should typically be looser than SLOs because they carry legal/financial expectations.
- SLO: internal goal (how you run)
- SLA: external contract (what you promise)
- It’s common to have SLOs without SLAs
The “good events / total events” mental model
Most practical SLIs can be expressed as a ratio. The hard part is not the math — it’s deciding what counts as “good.” Be explicit about the rules so your graphs and your incident reviews tell the same story.
| SLI type | “Good” means… | Common trap |
|---|---|---|
| Availability / success rate | Request returns a valid successful outcome | Counting only HTTP 200, ignoring partial failures/timeouts |
| Latency | Request completes within a threshold (e.g., < 300ms) | Using averages (they hide tail latency) |
| Freshness | Data is up-to-date within an age limit | Measuring “job ran” instead of “data is fresh” |
| Correctness | Result is accurate and validated | No automated validation; “looks fine” checks |
Burn rate: how fast you’re spending the budget
Error budgets aren’t just a monthly scorecard. The operational value comes from burn rate: are we spending the budget too fast to recover before the window ends? That’s what makes alerts meaningful and reduces noise.
You can have perfect dashboards and still have unclear reliability goals. SLOs are about decisions: what triggers paging, what blocks releases, and what work gets prioritized.
Step-by-step
Here’s a practical way to implement SRE basics for a real service. You can use this on a single API endpoint or an end-to-end journey. The key is to start narrow, make it measurable, and then let the error budget guide decisions.
Step 1 — Choose the scope: one journey, one boundary
Start with a user journey that’s frequent and business-critical. Pick where you measure: edge (API gateway/load balancer) for user experience, or service (app metrics) for faster diagnosis. Many teams do both later; start with one.
Good first journeys
- Public API: “create order”, “search”, “login”
- Internal platform: “build artifact available”, “deploy succeeded”
- Data pipeline: “event ingested”, “report generated”
Scope checklist
- What’s the user-visible success condition?
- What errors matter (timeouts, 5xx, invalid results)?
- What’s excluded (health checks, bots, internal-only paths)?
Step 2 — Define “good” and “bad” events
This is the most important step. “Good” should match reality, not convenience. Example: if your API returns 200 but delivers an empty/incorrect result, it shouldn’t be “good.” When in doubt, make “good” stricter and document why.
Count an event as “good” only if the user would agree it was successful. If you can’t validate correctness automatically, start with success rate + latency and add correctness SLIs later.
Step 3 — Write the SLO in a small “contract” document
Your team needs a canonical SLO definition that’s easy to review and hard to misinterpret. Some teams store this as YAML in the service repo so changes are code-reviewed like everything else.
service: checkout-api
owner: platform-oncall
window: 30d # rolling 30-day window
slis:
- name: request_success_rate
description: "Share of user checkout requests that succeed"
good_event: "HTTP status is 2xx AND order_id is present"
total_event: "All non-healthcheck /checkout requests"
- name: request_latency
description: "Share of requests under 300ms"
threshold_ms: 300
good_event: "duration_ms <= threshold_ms AND request_success_rate is good"
objectives:
- sli: request_success_rate
target: 0.999 # 99.9%
- sli: request_latency
target: 0.95 # 95% under 300ms
error_budget_policy:
freeze_deploys_when_budget_remaining_below: 0.25
page_when_burn_rate_over:
short_window: { window: 5m, threshold: 14 } # fast burn
long_window: { window: 1h, threshold: 6 } # sustained burn
notes:
- "Exclude synthetic probes that do not represent real users."
- "Include timeouts as bad events."
Step 4 — Do the error budget math (make it visible)
Error budgets are easiest to use when you translate them into “how much pain is allowed” and track remaining budget over the window. Avoid vanity targets: pick a number you can sustain and improve over time.
| SLO example | Budget | Meaning | What to watch |
|---|---|---|---|
| 99.9% success (30d) | 0.1% | Up to 1 in 1000 requests can fail (by your definition) | Retries & duplicate side effects can hide failures |
| 95% < 300ms | 5% | Up to 5 in 100 requests can be slow (tail matters) | Look at P95/P99 in addition to the SLO threshold |
| Freshness < 10m | Depends | How often data can be “too old” in the window | Backlogs and stuck jobs spend budget quickly |
Step 5 — Alert on budget burn, not on every error
Paging on raw error rate tends to be noisy: small spikes wake people up without changing outcomes. Burn-rate alerting asks a better question: will we run out of error budget soon if this continues?
# Example PromQL-style expressions (adapt to your metrics).
# 1) Error ratio over a window (bad / total)
error_ratio_5m =
sum(rate(http_requests_total{service="checkout-api",status=~"5..|408|499"}[5m]))
/
sum(rate(http_requests_total{service="checkout-api"}[5m]))
# 2) Burn rate = error_ratio / error_budget
# For 99.9% SLO, error_budget = 0.001
burn_rate_5m = error_ratio_5m / 0.001
# 3) Multi-window alert: fast burn AND sustained burn
alert: SLO_BurnRateHigh
expr: (burn_rate_5m > 14) and (burn_rate_1h > 6)
for: 2m
labels: { severity="page" }
Short windows catch sudden breakage fast. Long windows confirm it’s not a brief blip. Combining them reduces flapping and pager fatigue while still catching real incidents quickly.
Step 6 — Use the error budget to guide change velocity
Error budgets are a coordination mechanism between feature work and reliability work. A simple policy is enough to start: if budget is healthy, ship; if budget is burning, stabilize.
When budget is healthy
- Ship features and experiments (with safe rollbacks)
- Do planned maintenance and refactors
- Pay down reliability debt proactively
When budget is low or burning fast
- Freeze risky releases for the affected service
- Prioritize incident fixes and reliability work
- Reduce blast radius (rate limits, feature flags, fallbacks)
Step 7 — Review monthly and refine (don’t set-and-forget)
Your first SLO won’t be perfect. That’s normal. Do a lightweight review: were incidents captured by the SLI? did alerts fire at the right time? did the policy change decisions? Then iterate on definitions and thresholds.
Monthly SLO review checklist
- Did we meet the SLO? If not, what was the primary cause?
- Did our SLI match user pain (or did we miss important failures)?
- Were alerts actionable (right people, right urgency, right noise level)?
- Did error budget policy actually influence releases/priorities?
- Do we need separate SLOs for different tiers (free vs paid, internal vs external)?
from dataclasses import dataclass
@dataclass
class SLO:
target: float # e.g. 0.999 for 99.9%
window_seconds: int # e.g. 30 days
def error_budget(slo: SLO):
budget_fraction = 1.0 - slo.target
allowed_downtime_seconds = int(budget_fraction * slo.window_seconds)
return budget_fraction, allowed_downtime_seconds
def fmt_time(seconds: int) -> str:
m, s = divmod(seconds, 60)
h, m = divmod(m, 60)
d, h = divmod(h, 24)
parts = []
if d: parts.append(f"{d}d")
if h: parts.append(f"{h}h")
if m: parts.append(f"{m}m")
if s or not parts: parts.append(f"{s}s")
return " ".join(parts)
if __name__ == "__main__":
days_30 = 30 * 24 * 60 * 60
slo = SLO(target=0.999, window_seconds=days_30)
budget, downtime = error_budget(slo)
print(f"SLO target: {slo.target*100:.2f}%")
print(f"Error budget: {budget*100:.3f}% of the window")
print(f"Max downtime in 30d (time-based approximation): {fmt_time(downtime)}")
Many services use request-based SLIs (good/total requests), not pure uptime minutes. The “downtime minutes” table is useful for intuition, but you should compute your budget using the same SLI definition you use for reporting and alerting.
Common mistakes
Most SLO programs fail for predictable reasons: wrong metrics, too many goals, or goals that don’t change decisions. Use this list as a “pre-mortem” before you roll out SRE basics to more services.
Mistake 1 — Measuring what’s easy, not what users feel
CPU and memory are important, but they’re not SLIs.
- Fix: start with success rate and latency on a user journey.
- Fix: measure at the boundary closest to user experience (edge or API gateway) when possible.
Mistake 2 — Setting an SLO that’s basically an SLA
If you set the internal target too tight, you’ll constantly be “failing,” and the metric will be ignored.
- Fix: pick an achievable target and improve over time.
- Fix: keep SLA looser than SLO if you have both.
Mistake 3 — Using averages for latency
Averages hide tail pain. Users feel the slowest requests.
- Fix: use thresholds (“% under 300ms”) and percentiles (P95/P99) as supporting metrics.
- Fix: separate latency SLOs for different endpoints if one dominates.
Mistake 4 — Alerting on symptoms without a budget context
Paging on “any spike” trains teams to ignore alerts.
- Fix: use burn-rate alerting with multi-window confirmation.
- Fix: page only when there’s real user impact or budget risk.
Mistake 5 — Too many SLOs at once
If nobody can remember them, they won’t be used.
- Fix: start with 1–2 SLOs per service, focused on the top journeys.
- Fix: add more only when they change operational decisions.
Mistake 6 — Vague “good event” definitions
If “good” isn’t defined, reliability debates never end.
- Fix: write down what counts as success (including timeouts and partial failures).
- Fix: document exclusions and revisit them during incident reviews.
Ask your team: “If this SLO goes red, what do we do differently tomorrow?” If the answer is “nothing,” it’s not an SLO yet — it’s a chart.
FAQ
What is the difference between an SLI and an SLO?
An SLI is the measurement (for example, “% of successful requests”). An SLO is the target for that measurement over a window (for example, “99.9% success over 30 days”). SLIs are about facts; SLOs are about goals.
How do you calculate an error budget?
Error budget is 1 − SLO. For a 99.9% SLO, the budget is 0.1% of the window. If you measured 10,000 requests in the window, you can “spend” up to 10 bad requests (if your SLI is request-based).
Should we start with availability or latency SLOs?
Start with availability (success rate) for most services because it maps cleanly to “the thing works.” Add latency next if users complain about slowness or if you have tight performance expectations. Avoid starting with complex multi-dimensional SLOs until your team is comfortable with the basics.
What’s a good first SLO target?
There’s no universal number. A pragmatic starting point for many customer-facing APIs is around 99.9% success over 30 days, but the right target depends on user expectations, business impact, and how much complexity you can fund. Start achievable, then raise the bar as your system and practices mature.
Are SLOs only for big companies with SRE teams?
No. Small teams benefit even more because SLOs reduce wasted effort: fewer noisy alerts, clearer priorities, and less debate. You don’t need a dedicated SRE team — you need one service, one SLI, and one SLO to start.
What is burn rate and why does it matter?
Burn rate is how quickly you’re spending your error budget. It matters because it predicts whether you will miss the SLO before the window ends. It’s a better paging signal than raw error counts, because it ties alerts to real reliability risk.
How do we handle planned maintenance in SLOs?
Be explicit. Either include maintenance as “bad” (if users experience it) or exclude it with a documented rule (for example, maintenance windows announced in advance). The important part is consistency: your SLI, reporting, and incident reviews should all follow the same rules.
Cheatsheet
A scan-fast checklist for applying SRE basics: SLIs, SLOs, and error budgets.
SLO starter pack
- Pick 1 user journey (checkout/login/search)
- Define 1 availability SLI (good/total)
- Set 1 SLO target + window (e.g., 99.9% / 30d)
- Compute error budget (1 − SLO)
- Track remaining budget and burn rate
Alerting rules of thumb
- Page on user impact or budget risk (burn rate), not every spike
- Use at least two windows (fast + sustained)
- Keep alerts actionable (clear owner, runbook, next step)
- Separate paging from ticketing (urgent vs important)
What to measure (SLIs)
- Success rate: % of requests that succeed
- Latency: % under threshold (and track P95/P99)
- Freshness: age of data < threshold
- Correctness: validated results are correct
What not to call an SLI
- CPU/memory/disk usage (useful signals, but not user outcomes)
- “Number of errors” without a denominator
- Health checks only (they can be green while users are broken)
- Anything you can “improve” by changing logging
You’ve implemented SRE basics when your team uses the SLO to decide: “Do we ship?” “Do we page?” “What do we fix next?” If it’s just a dashboard, it’s not yet doing the job.
Wrap-up
SRE basics in plain English come down to three moves: measure user experience (SLIs), set a clear target (SLOs), and use the error budget to make tradeoffs without drama. You don’t need a big program to start — you need one journey, one SLI, one SLO, and one policy that changes decisions.
Next actions (pick one)
- Today: write one “good events / total events” SLI for your most important endpoint
- This week: set a 30-day SLO and compute the error budget; add a dashboard for remaining budget
- This month: switch paging to burn-rate alerts and adopt a simple “freeze releases when budget is low” policy
Want to go deeper? Check the related posts below for runbooks, CI/CD patterns, GitOps rollbacks, and infrastructure practices that support reliable systems.
Quiz
Quick self-check. One correct answer per question (SRE basics: SLIs, SLOs, error budgets).