Architecture Review Checklist: Catch Problems Early

By Samuel Labant Published Jan 9, 2026 Updated Jan 9, 2026

Architecture reviews are not about “perfect diagrams” or gatekeeping. They’re about catching the handful of problems that are expensive to fix later: unclear boundaries, missing failure handling, shaky data contracts, and “we’ll add observability later” (spoiler: later becomes never).

This post gives you a practical architecture review checklist you can use for new services and big changes: what to ask, what to write down, and how to end a review with decisions and owners—not just vibes.

Quickstart

If you have 30–60 minutes before you start building (or before merging a big change), do this. It’s intentionally lightweight: enough structure to prevent surprises, without turning into a weeks-long ceremony.

1) Write a one-page “review packet”

Your reviewer shouldn’t need to read code to understand the change.

Goal: what user/business outcome changes?
Scope: what’s in / out for this iteration?
Interfaces: APIs/events/data tables that change
Dependencies: upstream + downstream systems
Risks: top 3 failure modes and mitigations

2) Draw two diagrams (max)

More diagrams often means less clarity. Keep it strict.

System context: who calls what, and why?
Critical path: the main request flow + data flow
Label trust boundaries and external dependencies
Annotate latency-sensitive hops and single points of failure

3) Pick your “quality bar” up front

Most review pain is hidden requirements. Surface them early.

Target SLO (availability + latency) for the critical path
Data correctness: what must never be wrong?
Security posture: authN/authZ + sensitive data handling
Cost & capacity: how usage scales and where it breaks

4) End with decisions + owners

A review that doesn’t change anything is just theater.

List 3–7 decisions (or open questions)
Assign an owner and a due date for each action
Document one “not doing” decision (scope control)
Save the packet in the repo (so it stays findable)

Best timing

Run the architecture review checklist when the design is cheap to change: right after you’ve clarified requirements, but before the hard dependency choices are locked in.

Overview

An architecture review is a structured conversation that asks: “Will this system behave acceptably under real conditions?” Real conditions include traffic spikes, partial outages, slow dependencies, bad inputs, permission errors, and the messy reality of teams changing over time.

What this post covers

A practical definition of an architecture review (and what it’s not)
The core concepts reviewers should anchor on (boundaries, contracts, failure modes, quality attributes)
A step-by-step review process you can run in under 90 minutes
Common mistakes that create production incidents months later
A compact cheatsheet you can paste into a doc or PR template

Review type	When to use it	What “good” looks like
New service	First deployment, new data store, new API surface	Clear boundaries, versioned contracts, deploy plan, observability baseline
Big change	Major refactor, new dependency, scaling shift, multi-region	Migration plan, backward compatibility, rollback path, capacity story
Risk-driven review	High-stakes domain (money, security, safety), known fragile area	Explicit threat model, strict guardrails, measurable risk reduction

What an architecture review is not

Not a design-by-committee meeting
Not a debate about frameworks or preferences
Not a substitute for testing, load testing, or security review
Not a one-time document that never updates

Core concepts

Good reviews use the same mental models every time. That consistency is the point: it prevents teams from reviewing only what they happen to remember. These are the concepts that show up in almost every “we didn’t think of that” incident.

1) Quality attributes (the invisible requirements)

Most architecture decisions are tradeoffs between quality attributes: reliability vs cost, consistency vs latency, speed of delivery vs risk. Make these explicit and your review gets dramatically easier.

Attribute	What to ask in a review	Red flags
Reliability	What happens when a dependency is slow or down?	Retries everywhere, no timeouts, no degradation path
Scalability	What scales with traffic (CPU, DB writes, queue lag)?	Unbounded fan-out, shared bottleneck DB/table, hot keys
Security	Where are trust boundaries and sensitive data flows?	Implicit auth, over-broad permissions, secrets in logs
Operability	How do we debug, alert, roll back, and measure health?	“We’ll add logs later”, no SLOs, no runbook owners
Maintainability	What boundaries prevent tight coupling over time?	Leaky abstractions, shared internal schemas, no API versioning

2) Boundaries & contracts (where systems break)

Most outages are not “code is wrong” problems. They’re boundary problems: a contract changed silently, a timeout assumption was false, a retry storm amplified a small incident.

Boundaries to name explicitly

Service-to-service APIs (sync)
Events/queues (async)
Databases and shared stores
Third-party APIs
Identity, auth, and secrets systems

Contracts to write down

Request/response schema (and versioning strategy)
Idempotency rules
Ordering guarantees (if any)
Error model (what codes/messages mean)
Latency budget per hop

3) Critical paths & failure modes

A review should focus on critical user journeys: the flows that matter most and that generate the majority of risk. For each journey, identify the top failure modes and the system behavior you want.

Failure mode thinking (quick pattern)

What can fail? (dependency down, slow DB, partial data, invalid input)
What do we do? (retry/backoff, fallback, degrade, fail fast)
How do we detect it? (alerts, dashboards, SLO burn rate)
How do we recover? (rollback, replay, operator action)

4) Decisions, not opinions: the ADR mindset

If you don’t record decisions, the same debates return every quarter—except now they’re happening during an incident. A tiny architecture decision record (ADR) makes a review “stick” and gives future engineers the missing context.

A subtle failure

“We agreed on it in the meeting” is not a decision. A decision is something you can point to later: what we chose, what we rejected, and what would make us revisit it.

Step-by-step

This is a repeatable process for running an architecture review checklist without turning it into a bureaucratic event. It scales down to two people and up to larger teams because it’s built around artifacts (the packet + decisions), not around meetings.

Step 0 — Decide if you need a review (scope gate)

Run a review if any of these are true

New service, new API surface, or new public endpoint
New data store, new schema, or major migration
New dependency (especially third-party or cross-team)
Scaling shift (10x traffic), multi-region, or new latency target
High-stakes domain (security, money movement, compliance)

Step 1 — Create a review packet (15–30 minutes)

The packet is a small doc that lets reviewers understand the change quickly and ask high-quality questions. Keep it short, but complete. Aim for one page, with links if needed.

Packet sections (recommended)

Context: what problem are we solving?
Proposed design: the “happy path” flow
Alternatives: 1–2 serious options and why not
Risks: top 3, with mitigations
Rollout: migration + backwards compatibility
Ops: metrics/logs/traces + on-call ownership

Two diagrams that pay off

Context diagram: actors + systems + trust boundaries
Sequence / data flow: one critical path request end-to-end
Mark: queues, retries, timeouts, caches
Mark: data stores and data classification (PII/sensitive?)

If your team struggles with “review packets” staying consistent, standardize the decision part with a tiny ADR. Copy this template and keep it in your repo.

# ADR: <short decision title>

- **Status:** Proposed | Accepted | Deprecated
- **Date:** YYYY-MM-DD
- **Owner:** <team/person>

## Context
What problem are we solving? What constraints matter (latency, cost, compliance, team boundaries)?

## Decision
What did we decide? Be specific (technology choice, pattern, contract shape, ownership).

## Options considered
1) Option A — why it helps, why it hurts
2) Option B — why it helps, why it hurts
3) Do nothing — what happens if we don’t change anything

## Consequences
- Positive outcomes we expect
- New risks introduced (and how we’ll mitigate them)
- Operational impact (on-call, dashboards, runbooks)

## Revisit triggers
What evidence would make us change our mind (scale thresholds, incident types, cost limits)?

Step 2 — Validate contracts and data flows (the “breakage” pass)

This pass is about integration reality. Ask: what breaks if someone changes something? That includes your own team six months from now.

Area	Checklist questions	Practical evidence
API changes	Is it backward compatible? Is versioning explicit? What’s the deprecation plan?	OpenAPI/contract doc, compatibility tests, migration timeline
Events	Schema evolution rules? At-least-once semantics? Idempotency strategy?	Event schema registry rules, consumer contract tests
Data	Who owns the schema? What are retention/PII rules? How do we backfill?	Data dictionary, ownership doc, backfill playbook
Dependencies	Timeouts/retries? Circuit breakers? Rate limits? What if dependency is degraded?	Client configs, SLOs, known failure handling in code

Step 3 — Do a failure-mode walkthrough (the “incident” pass)

Pick one critical path and walk it like an incident commander. You’re looking for missing defaults: timeouts, backpressure, degraded modes, and unclear ownership.

Walkthrough prompts

What if the DB is slow for 10 minutes?
What if the queue backs up for 1 hour?
What if we deploy a buggy version?
What if one tenant/customer is noisy?
What if a downstream service starts returning 500s?

Healthy defaults you want to see

Explicit timeouts (client and server)
Bounded retries with backoff + jitter
Idempotency for retries and replays
Graceful degradation and feature flags
Backpressure / queue limits / load shedding

Keep it concrete

Ask reviewers to point to where timeouts, limits, and safety mechanisms live: config keys, manifests, libraries, or runbooks. “We’ll add it” is not evidence.

Step 4 — Check operability (debugging is a feature)

Operability is where architecture meets real life. If you can’t explain how you’ll know it’s broken and how you’ll fix it, you’re shipping a system that will be debugged in production.

Minimum operability checklist

Service-level dashboards: latency, error rate, saturation
Tracing across the critical path (at least between services you own)
Structured logs with request IDs (and sensitive data redaction)
Alerts tied to user impact (SLO burn rate beats noisy thresholds)
Runbook: what to do for the top 3 incidents

If you deploy to Kubernetes, reviewers will often ask for “production defaults” (resources, probes, disruption handling). Here’s a small, review-friendly snippet you can adapt.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-service
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: app
          image: example/service:1.2.3
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "1000m"
              memory: "512Mi"
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
          securityContext:
            runAsNonRoot: true
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true

---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: example-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: example-service

Step 5 — Make review outcomes actionable (decisions + follow-ups)

A review ends with a small list of actions that reduce risk measurably. If you get 25 action items, you didn’t prioritize. If you get zero action items, you probably didn’t look at the risky parts.

Good action items are

Specific: “Add timeout=2s to outbound call X”
Owned: named person/team
Timed: before launch, or by a specific date
Risk-linked: tied to a failure mode

Bad action items sound like

“Improve reliability”
“Add better monitoring”
“Consider caching”
“Make it scalable”

Want to make reviews faster? Automate the “gathering” step so the packet always includes the same basics: version, endpoints, dependencies, and deployment config.

#!/usr/bin/env bash
set -euo pipefail

# Create a simple "review bundle" folder you can attach to a doc or PR.
# Assumes you're running from a service repo.

OUT="review-bundle"
rm -rf "$OUT"
mkdir -p "$OUT"

echo "Commit:" > "$OUT/meta.txt"
git rev-parse HEAD >> "$OUT/meta.txt"
echo "Date:" >> "$OUT/meta.txt"
date -Iseconds >> "$OUT/meta.txt"

# Capture key files (adjust paths for your repo)
cp -f README.md "$OUT/" 2>/dev/null || true
cp -f docs/architecture.md "$OUT/" 2>/dev/null || true
cp -f openapi.yaml "$OUT/" 2>/dev/null || true
cp -f docker-compose.yml "$OUT/" 2>/dev/null || true

# K8s manifests (if present)
if [ -d k8s ]; then
  tar -czf "$OUT/k8s-manifests.tgz" k8s
fi

# Quick dependency snapshot (language-agnostic heuristic)
( ls -1 package.json requirements.txt go.mod Cargo.toml pom.xml 2>/dev/null || true ) > "$OUT/deps-files.txt"

# Basic "what changed" context vs main (works if main exists locally)
git fetch origin main --quiet 2>/dev/null || true
git log --oneline --decorate origin/main..HEAD > "$OUT/changes.txt" 2>/dev/null || git log --oneline --decorate -20 > "$OUT/changes.txt"

echo "Bundle created at: $OUT/"
echo "Tip: attach it to the review packet and link the folder from the PR."

If your review keeps stalling

Move debates from “Is X better than Y?” to “What risk are we buying or reducing?” Most choices are fine if their failure modes and ops plan are explicit.

Common mistakes

These are the patterns behind “it seemed fine in dev” incidents. The fixes are usually simple—but only if you notice the pattern early.

Mistake 1 — Reviewing too late

If key choices are already implemented, reviewers can only nitpick.

Fix: run the review when the packet exists but the code is still cheap to change.
Fix: time-box: 60–90 minutes plus follow-ups.

Mistake 2 — Missing non-functional requirements

Teams talk features, then reliability/security shows up as a surprise.

Fix: define SLO/latency target and error costs up front.
Fix: tie each major decision to a quality attribute.

Mistake 3 — “We’ll add observability later”

Later becomes production debugging with no visibility.

Fix: require a minimal dashboard + runbook before launch.
Fix: treat logs/metrics/traces as acceptance criteria.

Mistake 4 — Unbounded retries and fan-out

This turns small outages into system-wide incidents (retry storms).

Fix: timeouts, bounded retries, and backoff with jitter.
Fix: add bulkheads: queue limits, rate limits, circuit breaking.

Mistake 5 — Contracts without versioning

Silent breaking changes are the fastest way to create “random” failures.

Fix: version APIs/events, publish deprecation windows, test compatibility.
Fix: write “what breaks if we change this field?” in the packet.

Mistake 6 — No rollback or migration plan

Deploys become irreversible, and you lose your safety net.

Fix: plan backwards-compatible migrations and feature-flagged rollout.
Fix: define a rollback trigger (metrics threshold) before launch.

Review smell test

If the review spends most of its time on naming, frameworks, or personal preferences, you probably haven’t clarified failure modes, contracts, or quality bars.

FAQ

How long should an architecture review take?

For a new service or major change, target 60–90 minutes for the meeting plus a short async packet review. If it takes longer, the scope is probably too large (split it), or the packet is missing key details (tighten the template).

What should be in an architecture review packet?

At minimum: the goal, the proposed flow, key interfaces/contracts, top dependencies, top risks, and the rollout/ops plan. If a reviewer asks “where does data go?” or “what happens when X fails?” the packet should answer it without digging through code.

Do small teams really need architecture reviews?

Yes—but keep them lightweight. Small teams are even more vulnerable to “tribal knowledge” and implicit decisions. A short checklist and a couple of recorded decisions prevent rework and reduce on-call pain.

What’s the difference between an architecture review and a design doc?

A design doc describes what you plan to build. An architecture review is the evaluation of that plan against reliability, scalability, security, operability, and maintainability concerns—especially at system boundaries.

What if reviewers disagree?

Anchor disagreement to outcomes: cost of failure, latency targets, data correctness requirements, team ownership, and time-to-deliver. If you can’t resolve it quickly, capture an ADR with the options and revisit triggers so the decision is reversible and evidence-driven.

How do we keep reviews from becoming bureaucracy?

Time-box them, standardize the packet, and insist on outcomes: decisions, owners, and dates. The goal is fewer surprises in production, not “more documents.”

Cheatsheet

Copy/paste this into a doc, ticket, or PR description. It’s the architecture review checklist in scan-fast form.

Context & scope

What user outcome changes?
What’s explicitly out of scope?
Who owns the service long-term?
What does “done” mean (quality bar)?

Boundaries & contracts

APIs/events/data schemas versioned?
Backward compatibility strategy documented?
Idempotency rules defined?
Error model and timeouts explicit?

Reliability & scaling

Critical path identified (sequence diagram)?
Failure modes + mitigations listed?
Retries bounded; backoff + jitter used?
Backpressure/load shedding strategy?
Capacity story (what scales with traffic)?

Security & data

AuthN/authZ model clear (least privilege)?
Sensitive data flows identified and minimized?
Secrets handled correctly (no logs, no repo)?
Auditability needs (who did what)?

Operability & rollout

Dashboards: latency, errors, saturation?
Traces/logs with request IDs and redaction?
Alerts tied to user impact (SLO-based)?
Migration plan + rollback plan?
Feature flags / gradual rollout strategy?

Review outcomes

3–7 decisions recorded (ADR or packet)?
Action items have owners + dates?
One explicit “not doing” decision captured?
Packet saved in repo for future readers?

Make it habitual

The easiest way to keep architecture reviews lightweight is to run them often in small scopes. Big-bang reviews feel heavy because the change is too large to reason about.

Wrap-up

A good architecture review checklist is a force multiplier: it turns scattered experience into a repeatable process. You don’t need perfect foresight—just consistent habits that catch the big risks early.

If you do nothing else

Write the one-page packet (goal, flow, contracts, risks, rollout)
Walk the critical path and name failure modes
Make operability a requirement (dashboards + runbook)
Record decisions (ADRs) and assign owners

Want to go deeper on the pieces that reviews frequently uncover? These posts are a natural next step: Clean Architecture, Event-Driven Architecture, and Design for Observability.

UniLab Editorial

Modern learning notes for practical builders.

Architecture Review Checklist: Catch Problems Early

Quickstart

1) Write a one-page “review packet”

2) Draw two diagrams (max)

3) Pick your “quality bar” up front

4) End with decisions + owners

Overview

What this post covers

Core concepts

1) Quality attributes (the invisible requirements)

2) Boundaries & contracts (where systems break)

Boundaries to name explicitly

Contracts to write down

3) Critical paths & failure modes

Failure mode thinking (quick pattern)

4) Decisions, not opinions: the ADR mindset

Step-by-step

Step 0 — Decide if you need a review (scope gate)

Run a review if any of these are true

Step 1 — Create a review packet (15–30 minutes)

Packet sections (recommended)

Two diagrams that pay off

Step 2 — Validate contracts and data flows (the “breakage” pass)

Step 3 — Do a failure-mode walkthrough (the “incident” pass)

Walkthrough prompts

Healthy defaults you want to see

Step 4 — Check operability (debugging is a feature)

Minimum operability checklist

Step 5 — Make review outcomes actionable (decisions + follow-ups)

Good action items are

Bad action items sound like

Common mistakes

Mistake 1 — Reviewing too late

Mistake 2 — Missing non-functional requirements

Mistake 3 — “We’ll add observability later”

Mistake 4 — Unbounded retries and fan-out

Mistake 5 — Contracts without versioning

Mistake 6 — No rollback or migration plan

FAQ

How long should an architecture review take?

What should be in an architecture review packet?

Do small teams really need architecture reviews?

What’s the difference between an architecture review and a design doc?

What if reviewers disagree?

How do we keep reviews from becoming bureaucracy?

Cheatsheet

Context & scope

Boundaries & contracts

Reliability & scaling

Security & data

Operability & rollout

Review outcomes

Wrap-up

If you do nothing else

Quiz

Related posts