SRE Basics: SLIs, SLOs, Error Budgets in Plain English

By Samuel Labant Published Jan 9, 2026 Updated Jan 9, 2026

SRE basics sound abstract until you use them once: SLIs tell you how users experience your service, SLOs tell you what “good” means, and error budgets turn reliability into a shared decision tool instead of an endless argument. This guide explains SLIs, SLOs, and error budgets in plain English, with examples you can apply to any API, website, or internal platform.

Quickstart

If you only do one thing after reading this post, do this: pick one user-critical journey, define one SLI, set one SLO, and start tracking the error budget. The steps below are designed to get you there fast.

1) Pick a single user journey

Choose something real users care about (not “CPU usage”). Examples: “checkout succeeds”, “search returns results”, “login works”, “webhook accepted”.

Write the journey in one sentence
Define what “success” means in the response (status code, business outcome, correctness)
Pick a scope (one service or the full end-to-end path)

2) Define one SLI that matches user experience

A good SLI is measurable and user-facing. Start with one: availability (success rate) or latency.

Availability: % of successful requests
Latency: % of requests under a threshold (or a percentile)
Keep it simple: “good events / total events”

3) Set an SLO with a time window

An SLO is your reliability target for a window (usually 7, 28, or 30 days). Example: 99.9% successful requests over 30 days.

Choose a window (rolling is usually easier than calendar-based)
Pick a target you can meet most of the time
Document exclusions explicitly (planned maintenance, known non-user paths)

4) Convert the SLO into an error budget

Your error budget is the allowed “badness” in the window: 1 − SLO. This is what makes reliability actionable.

For request-based SLIs: allowed bad requests = (1 − SLO) × total requests
For time-based “uptime”: allowed downtime = (1 − SLO) × window duration
Track how fast you’re spending it (burn rate)

Fastest “good enough” starting point

Start with one SLO for one journey. Add more only when you can explain how each SLO changes decisions: alerts, deploy pace, capacity work, or incident response.

SLO (30 days)	Error budget	Max downtime (approx.)	When to use
99.0%	1.0%	~7h 12m	Early/internal tools, non-critical systems
99.9%	0.1%	~43m	Common default for customer-facing APIs
99.95%	0.05%	~21m 36s	Higher expectations, mature operations
99.99%	0.01%	~4m 19s	Only if you can fund the complexity

Overview

The core promise of SRE isn’t “never have incidents.” It’s this: make reliability a product feature with clear goals, and use data to balance reliability work with shipping new features. That balance is what prevents “heroics” — the late-night firefighting, guessy alerting, and endless debates about what “good uptime” means.

What this post covers

SLIs, SLOs, and error budgets (definitions + mental model)
How to pick SLIs that match user experience
How to set SLOs that teams can actually use
How error budgets change operational decisions
Common mistakes (and how to avoid them)

What this post is not

A vendor-specific monitoring tutorial
A promise that one number will solve reliability
A recommendation to set 99.99% on day one
A replacement for incident response and good runbooks

Why SRE metrics feel “weird” at first

Teams often start with system metrics (CPU, memory, node health). SRE starts with user-perceived outcomes. That shift is the whole point: reliability measured the way customers actually experience it.

Term	Plain-English meaning	Example
SLI	A measurement of reliability from the user’s perspective	% of successful checkout requests
SLO	A target for an SLI over a defined time window	99.9% success over 30 days
SLA	A contract or promise to customers (often with penalties)	“We guarantee 99.5% monthly uptime”
Error budget	The allowed unreliability in the SLO window	0.1% bad requests per 30 days

Core concepts

SRE basics become easy once you separate three things: what you measure (SLI), what you target (SLO), and how you make tradeoffs (error budget). The rest is implementation detail.

SLIs, SLOs, and error budgets: the plain-English trio

SLI (Service Level Indicator)

An SLI is a metric that reflects user experience. It should be stable, measurable, and hard to “game.”

Good SLIs: success rate, latency, freshness, correctness, durability
Weak SLIs: CPU usage, pod restarts, “number of errors” without context
Common shape: good events / total events

SLO (Service Level Objective)

An SLO is the target you aim for, tied to a time window. It’s a tool for planning and prioritization.

Includes: SLI definition + threshold + window
Example: “99.9% successful requests over 30 days”
Not the same as “always on” or “never fails”

Error budget

The amount of unreliability you can “spend” and still meet the SLO. It turns reliability into a budgeted resource.

Budget = 1 − SLO
Spending budget fast means you should slow risky changes
Saving budget means you can ship faster (or accept experiments)

SLA (Service Level Agreement)

Customer-facing promise. SLAs should typically be looser than SLOs because they carry legal/financial expectations.

SLO: internal goal (how you run)
SLA: external contract (what you promise)
It’s common to have SLOs without SLAs

The “good events / total events” mental model

Most practical SLIs can be expressed as a ratio. The hard part is not the math — it’s deciding what counts as “good.” Be explicit about the rules so your graphs and your incident reviews tell the same story.

SLI type	“Good” means…	Common trap
Availability / success rate	Request returns a valid successful outcome	Counting only HTTP 200, ignoring partial failures/timeouts
Latency	Request completes within a threshold (e.g., < 300ms)	Using averages (they hide tail latency)
Freshness	Data is up-to-date within an age limit	Measuring “job ran” instead of “data is fresh”
Correctness	Result is accurate and validated	No automated validation; “looks fine” checks

Burn rate: how fast you’re spending the budget

Error budgets aren’t just a monthly scorecard. The operational value comes from burn rate: are we spending the budget too fast to recover before the window ends? That’s what makes alerts meaningful and reduces noise.

Don’t confuse “monitoring” with “SLOs”

You can have perfect dashboards and still have unclear reliability goals. SLOs are about decisions: what triggers paging, what blocks releases, and what work gets prioritized.

Step-by-step

Here’s a practical way to implement SRE basics for a real service. You can use this on a single API endpoint or an end-to-end journey. The key is to start narrow, make it measurable, and then let the error budget guide decisions.

Step 1 — Choose the scope: one journey, one boundary

Start with a user journey that’s frequent and business-critical. Pick where you measure: edge (API gateway/load balancer) for user experience, or service (app metrics) for faster diagnosis. Many teams do both later; start with one.

Good first journeys

Public API: “create order”, “search”, “login”
Internal platform: “build artifact available”, “deploy succeeded”
Data pipeline: “event ingested”, “report generated”

Scope checklist

What’s the user-visible success condition?
What errors matter (timeouts, 5xx, invalid results)?
What’s excluded (health checks, bots, internal-only paths)?

Step 2 — Define “good” and “bad” events

This is the most important step. “Good” should match reality, not convenience. Example: if your API returns 200 but delivers an empty/incorrect result, it shouldn’t be “good.” When in doubt, make “good” stricter and document why.

A simple rule

Count an event as “good” only if the user would agree it was successful. If you can’t validate correctness automatically, start with success rate + latency and add correctness SLIs later.

Step 3 — Write the SLO in a small “contract” document

Your team needs a canonical SLO definition that’s easy to review and hard to misinterpret. Some teams store this as YAML in the service repo so changes are code-reviewed like everything else.

service: checkout-api
owner: platform-oncall
window: 30d   # rolling 30-day window
slis:
  - name: request_success_rate
    description: "Share of user checkout requests that succeed"
    good_event: "HTTP status is 2xx AND order_id is present"
    total_event: "All non-healthcheck /checkout requests"
  - name: request_latency
    description: "Share of requests under 300ms"
    threshold_ms: 300
    good_event: "duration_ms <= threshold_ms AND request_success_rate is good"
objectives:
  - sli: request_success_rate
    target: 0.999   # 99.9%
  - sli: request_latency
    target: 0.95    # 95% under 300ms
error_budget_policy:
  freeze_deploys_when_budget_remaining_below: 0.25
  page_when_burn_rate_over:
    short_window: { window: 5m,  threshold: 14 }   # fast burn
    long_window:  { window: 1h,  threshold: 6 }    # sustained burn
notes:
  - "Exclude synthetic probes that do not represent real users."
  - "Include timeouts as bad events."

Step 4 — Do the error budget math (make it visible)

Error budgets are easiest to use when you translate them into “how much pain is allowed” and track remaining budget over the window. Avoid vanity targets: pick a number you can sustain and improve over time.

SLO example	Budget	Meaning	What to watch
99.9% success (30d)	0.1%	Up to 1 in 1000 requests can fail (by your definition)	Retries & duplicate side effects can hide failures
95% < 300ms	5%	Up to 5 in 100 requests can be slow (tail matters)	Look at P95/P99 in addition to the SLO threshold
Freshness < 10m	Depends	How often data can be “too old” in the window	Backlogs and stuck jobs spend budget quickly

Step 5 — Alert on budget burn, not on every error

Paging on raw error rate tends to be noisy: small spikes wake people up without changing outcomes. Burn-rate alerting asks a better question: will we run out of error budget soon if this continues?

# Example PromQL-style expressions (adapt to your metrics).
# 1) Error ratio over a window (bad / total)
error_ratio_5m =
  sum(rate(http_requests_total{service="checkout-api",status=~"5..|408|499"}[5m]))
  /
  sum(rate(http_requests_total{service="checkout-api"}[5m]))

# 2) Burn rate = error_ratio / error_budget
# For 99.9% SLO, error_budget = 0.001
burn_rate_5m = error_ratio_5m / 0.001

# 3) Multi-window alert: fast burn AND sustained burn
alert: SLO_BurnRateHigh
expr: (burn_rate_5m > 14) and (burn_rate_1h > 6)
for: 2m
labels: { severity="page" }

Why multi-window alerting helps

Short windows catch sudden breakage fast. Long windows confirm it’s not a brief blip. Combining them reduces flapping and pager fatigue while still catching real incidents quickly.

Step 6 — Use the error budget to guide change velocity

Error budgets are a coordination mechanism between feature work and reliability work. A simple policy is enough to start: if budget is healthy, ship; if budget is burning, stabilize.

When budget is healthy

Ship features and experiments (with safe rollbacks)
Do planned maintenance and refactors
Pay down reliability debt proactively

When budget is low or burning fast

Freeze risky releases for the affected service
Prioritize incident fixes and reliability work
Reduce blast radius (rate limits, feature flags, fallbacks)

Step 7 — Review monthly and refine (don’t set-and-forget)

Your first SLO won’t be perfect. That’s normal. Do a lightweight review: were incidents captured by the SLI? did alerts fire at the right time? did the policy change decisions? Then iterate on definitions and thresholds.

Monthly SLO review checklist

Did we meet the SLO? If not, what was the primary cause?
Did our SLI match user pain (or did we miss important failures)?
Were alerts actionable (right people, right urgency, right noise level)?
Did error budget policy actually influence releases/priorities?
Do we need separate SLOs for different tiers (free vs paid, internal vs external)?

from dataclasses import dataclass

@dataclass
class SLO:
  target: float          # e.g. 0.999 for 99.9%
  window_seconds: int    # e.g. 30 days

def error_budget(slo: SLO):
  budget_fraction = 1.0 - slo.target
  allowed_downtime_seconds = int(budget_fraction * slo.window_seconds)
  return budget_fraction, allowed_downtime_seconds

def fmt_time(seconds: int) -> str:
  m, s = divmod(seconds, 60)
  h, m = divmod(m, 60)
  d, h = divmod(h, 24)
  parts = []
  if d: parts.append(f"{d}d")
  if h: parts.append(f"{h}h")
  if m: parts.append(f"{m}m")
  if s or not parts: parts.append(f"{s}s")
  return " ".join(parts)

if __name__ == "__main__":
  days_30 = 30 * 24 * 60 * 60
  slo = SLO(target=0.999, window_seconds=days_30)

  budget, downtime = error_budget(slo)
  print(f"SLO target: {slo.target*100:.2f}%")
  print(f"Error budget: {budget*100:.3f}% of the window")
  print(f"Max downtime in 30d (time-based approximation): {fmt_time(downtime)}")

Time-based downtime is an approximation

Many services use request-based SLIs (good/total requests), not pure uptime minutes. The “downtime minutes” table is useful for intuition, but you should compute your budget using the same SLI definition you use for reporting and alerting.

Common mistakes

Most SLO programs fail for predictable reasons: wrong metrics, too many goals, or goals that don’t change decisions. Use this list as a “pre-mortem” before you roll out SRE basics to more services.

Mistake 1 — Measuring what’s easy, not what users feel

CPU and memory are important, but they’re not SLIs.

Fix: start with success rate and latency on a user journey.
Fix: measure at the boundary closest to user experience (edge or API gateway) when possible.

Mistake 2 — Setting an SLO that’s basically an SLA

If you set the internal target too tight, you’ll constantly be “failing,” and the metric will be ignored.

Fix: pick an achievable target and improve over time.
Fix: keep SLA looser than SLO if you have both.

Mistake 3 — Using averages for latency

Averages hide tail pain. Users feel the slowest requests.

Fix: use thresholds (“% under 300ms”) and percentiles (P95/P99) as supporting metrics.
Fix: separate latency SLOs for different endpoints if one dominates.

Mistake 4 — Alerting on symptoms without a budget context

Paging on “any spike” trains teams to ignore alerts.

Fix: use burn-rate alerting with multi-window confirmation.
Fix: page only when there’s real user impact or budget risk.

Mistake 5 — Too many SLOs at once

If nobody can remember them, they won’t be used.

Fix: start with 1–2 SLOs per service, focused on the top journeys.
Fix: add more only when they change operational decisions.

Mistake 6 — Vague “good event” definitions

If “good” isn’t defined, reliability debates never end.

Fix: write down what counts as success (including timeouts and partial failures).
Fix: document exclusions and revisit them during incident reviews.

A quick self-test

Ask your team: “If this SLO goes red, what do we do differently tomorrow?” If the answer is “nothing,” it’s not an SLO yet — it’s a chart.

FAQ

What is the difference between an SLI and an SLO?

An SLI is the measurement (for example, “% of successful requests”). An SLO is the target for that measurement over a window (for example, “99.9% success over 30 days”). SLIs are about facts; SLOs are about goals.

How do you calculate an error budget?

Error budget is 1 − SLO. For a 99.9% SLO, the budget is 0.1% of the window. If you measured 10,000 requests in the window, you can “spend” up to 10 bad requests (if your SLI is request-based).

Should we start with availability or latency SLOs?

Start with availability (success rate) for most services because it maps cleanly to “the thing works.” Add latency next if users complain about slowness or if you have tight performance expectations. Avoid starting with complex multi-dimensional SLOs until your team is comfortable with the basics.

What’s a good first SLO target?

There’s no universal number. A pragmatic starting point for many customer-facing APIs is around 99.9% success over 30 days, but the right target depends on user expectations, business impact, and how much complexity you can fund. Start achievable, then raise the bar as your system and practices mature.

Are SLOs only for big companies with SRE teams?

No. Small teams benefit even more because SLOs reduce wasted effort: fewer noisy alerts, clearer priorities, and less debate. You don’t need a dedicated SRE team — you need one service, one SLI, and one SLO to start.

What is burn rate and why does it matter?

Burn rate is how quickly you’re spending your error budget. It matters because it predicts whether you will miss the SLO before the window ends. It’s a better paging signal than raw error counts, because it ties alerts to real reliability risk.

How do we handle planned maintenance in SLOs?

Be explicit. Either include maintenance as “bad” (if users experience it) or exclude it with a documented rule (for example, maintenance windows announced in advance). The important part is consistency: your SLI, reporting, and incident reviews should all follow the same rules.

Cheatsheet

A scan-fast checklist for applying SRE basics: SLIs, SLOs, and error budgets.

SLO starter pack

Pick 1 user journey (checkout/login/search)
Define 1 availability SLI (good/total)
Set 1 SLO target + window (e.g., 99.9% / 30d)
Compute error budget (1 − SLO)
Track remaining budget and burn rate

Alerting rules of thumb

Page on user impact or budget risk (burn rate), not every spike
Use at least two windows (fast + sustained)
Keep alerts actionable (clear owner, runbook, next step)
Separate paging from ticketing (urgent vs important)

What to measure (SLIs)

Success rate: % of requests that succeed
Latency: % under threshold (and track P95/P99)
Freshness: age of data < threshold
Correctness: validated results are correct

What not to call an SLI

CPU/memory/disk usage (useful signals, but not user outcomes)
“Number of errors” without a denominator
Health checks only (they can be green while users are broken)
Anything you can “improve” by changing logging

A simple success criterion

You’ve implemented SRE basics when your team uses the SLO to decide: “Do we ship?” “Do we page?” “What do we fix next?” If it’s just a dashboard, it’s not yet doing the job.

Wrap-up

SRE basics in plain English come down to three moves: measure user experience (SLIs), set a clear target (SLOs), and use the error budget to make tradeoffs without drama. You don’t need a big program to start — you need one journey, one SLI, one SLO, and one policy that changes decisions.

Next actions (pick one)

Today: write one “good events / total events” SLI for your most important endpoint
This week: set a 30-day SLO and compute the error budget; add a dashboard for remaining budget
This month: switch paging to burn-rate alerts and adopt a simple “freeze releases when budget is low” policy

Want to go deeper? Check the related posts below for runbooks, CI/CD patterns, GitOps rollbacks, and infrastructure practices that support reliable systems.

UniLab Editorial

Modern learning notes for practical builders.

SRE Basics: SLIs, SLOs, Error Budgets in Plain English

Quickstart

1) Pick a single user journey

2) Define one SLI that matches user experience

3) Set an SLO with a time window

4) Convert the SLO into an error budget

Overview

What this post covers

What this post is not

Core concepts

SLIs, SLOs, and error budgets: the plain-English trio

SLI (Service Level Indicator)

SLO (Service Level Objective)

Error budget

SLA (Service Level Agreement)

The “good events / total events” mental model

Burn rate: how fast you’re spending the budget

Step-by-step

Step 1 — Choose the scope: one journey, one boundary

Good first journeys

Scope checklist

Step 2 — Define “good” and “bad” events

Step 3 — Write the SLO in a small “contract” document

Step 4 — Do the error budget math (make it visible)

Step 5 — Alert on budget burn, not on every error

Step 6 — Use the error budget to guide change velocity

When budget is healthy

When budget is low or burning fast

Step 7 — Review monthly and refine (don’t set-and-forget)

Monthly SLO review checklist

Common mistakes

Mistake 1 — Measuring what’s easy, not what users feel

Mistake 2 — Setting an SLO that’s basically an SLA

Mistake 3 — Using averages for latency

Mistake 4 — Alerting on symptoms without a budget context

Mistake 5 — Too many SLOs at once

Mistake 6 — Vague “good event” definitions

FAQ

What is the difference between an SLI and an SLO?

How do you calculate an error budget?

Should we start with availability or latency SLOs?

What’s a good first SLO target?

Are SLOs only for big companies with SRE teams?

What is burn rate and why does it matter?

How do we handle planned maintenance in SLOs?

Cheatsheet

SLO starter pack

Alerting rules of thumb

What to measure (SLIs)

What not to call an SLI

Wrap-up

Next actions (pick one)

Quiz

Related posts