Software Architecture & Best Practices · Architecture Reviews

Architecture Review Checklist: Catch Problems Early

A lightweight checklist for new services and big changes.

Reading time: ~8–12 min
Level: All levels
Updated:

Architecture reviews are not about “perfect diagrams” or gatekeeping. They’re about catching the handful of problems that are expensive to fix later: unclear boundaries, missing failure handling, shaky data contracts, and “we’ll add observability later” (spoiler: later becomes never).

This post gives you a practical architecture review checklist you can use for new services and big changes: what to ask, what to write down, and how to end a review with decisions and owners—not just vibes.


Quickstart

If you have 30–60 minutes before you start building (or before merging a big change), do this. It’s intentionally lightweight: enough structure to prevent surprises, without turning into a weeks-long ceremony.

1) Write a one-page “review packet”

Your reviewer shouldn’t need to read code to understand the change.

  • Goal: what user/business outcome changes?
  • Scope: what’s in / out for this iteration?
  • Interfaces: APIs/events/data tables that change
  • Dependencies: upstream + downstream systems
  • Risks: top 3 failure modes and mitigations

2) Draw two diagrams (max)

More diagrams often means less clarity. Keep it strict.

  • System context: who calls what, and why?
  • Critical path: the main request flow + data flow
  • Label trust boundaries and external dependencies
  • Annotate latency-sensitive hops and single points of failure

3) Pick your “quality bar” up front

Most review pain is hidden requirements. Surface them early.

  • Target SLO (availability + latency) for the critical path
  • Data correctness: what must never be wrong?
  • Security posture: authN/authZ + sensitive data handling
  • Cost & capacity: how usage scales and where it breaks

4) End with decisions + owners

A review that doesn’t change anything is just theater.

  • List 3–7 decisions (or open questions)
  • Assign an owner and a due date for each action
  • Document one “not doing” decision (scope control)
  • Save the packet in the repo (so it stays findable)
Best timing

Run the architecture review checklist when the design is cheap to change: right after you’ve clarified requirements, but before the hard dependency choices are locked in.

Overview

An architecture review is a structured conversation that asks: “Will this system behave acceptably under real conditions?” Real conditions include traffic spikes, partial outages, slow dependencies, bad inputs, permission errors, and the messy reality of teams changing over time.

What this post covers

  • A practical definition of an architecture review (and what it’s not)
  • The core concepts reviewers should anchor on (boundaries, contracts, failure modes, quality attributes)
  • A step-by-step review process you can run in under 90 minutes
  • Common mistakes that create production incidents months later
  • A compact cheatsheet you can paste into a doc or PR template
Review type When to use it What “good” looks like
New service First deployment, new data store, new API surface Clear boundaries, versioned contracts, deploy plan, observability baseline
Big change Major refactor, new dependency, scaling shift, multi-region Migration plan, backward compatibility, rollback path, capacity story
Risk-driven review High-stakes domain (money, security, safety), known fragile area Explicit threat model, strict guardrails, measurable risk reduction
What an architecture review is not
  • Not a design-by-committee meeting
  • Not a debate about frameworks or preferences
  • Not a substitute for testing, load testing, or security review
  • Not a one-time document that never updates

Core concepts

Good reviews use the same mental models every time. That consistency is the point: it prevents teams from reviewing only what they happen to remember. These are the concepts that show up in almost every “we didn’t think of that” incident.

1) Quality attributes (the invisible requirements)

Most architecture decisions are tradeoffs between quality attributes: reliability vs cost, consistency vs latency, speed of delivery vs risk. Make these explicit and your review gets dramatically easier.

Attribute What to ask in a review Red flags
Reliability What happens when a dependency is slow or down? Retries everywhere, no timeouts, no degradation path
Scalability What scales with traffic (CPU, DB writes, queue lag)? Unbounded fan-out, shared bottleneck DB/table, hot keys
Security Where are trust boundaries and sensitive data flows? Implicit auth, over-broad permissions, secrets in logs
Operability How do we debug, alert, roll back, and measure health? “We’ll add logs later”, no SLOs, no runbook owners
Maintainability What boundaries prevent tight coupling over time? Leaky abstractions, shared internal schemas, no API versioning

2) Boundaries & contracts (where systems break)

Most outages are not “code is wrong” problems. They’re boundary problems: a contract changed silently, a timeout assumption was false, a retry storm amplified a small incident.

Boundaries to name explicitly

  • Service-to-service APIs (sync)
  • Events/queues (async)
  • Databases and shared stores
  • Third-party APIs
  • Identity, auth, and secrets systems

Contracts to write down

  • Request/response schema (and versioning strategy)
  • Idempotency rules
  • Ordering guarantees (if any)
  • Error model (what codes/messages mean)
  • Latency budget per hop

3) Critical paths & failure modes

A review should focus on critical user journeys: the flows that matter most and that generate the majority of risk. For each journey, identify the top failure modes and the system behavior you want.

Failure mode thinking (quick pattern)

  • What can fail? (dependency down, slow DB, partial data, invalid input)
  • What do we do? (retry/backoff, fallback, degrade, fail fast)
  • How do we detect it? (alerts, dashboards, SLO burn rate)
  • How do we recover? (rollback, replay, operator action)

4) Decisions, not opinions: the ADR mindset

If you don’t record decisions, the same debates return every quarter—except now they’re happening during an incident. A tiny architecture decision record (ADR) makes a review “stick” and gives future engineers the missing context.

A subtle failure

“We agreed on it in the meeting” is not a decision. A decision is something you can point to later: what we chose, what we rejected, and what would make us revisit it.

Step-by-step

This is a repeatable process for running an architecture review checklist without turning it into a bureaucratic event. It scales down to two people and up to larger teams because it’s built around artifacts (the packet + decisions), not around meetings.

Step 0 — Decide if you need a review (scope gate)

Run a review if any of these are true

  • New service, new API surface, or new public endpoint
  • New data store, new schema, or major migration
  • New dependency (especially third-party or cross-team)
  • Scaling shift (10x traffic), multi-region, or new latency target
  • High-stakes domain (security, money movement, compliance)

Step 1 — Create a review packet (15–30 minutes)

The packet is a small doc that lets reviewers understand the change quickly and ask high-quality questions. Keep it short, but complete. Aim for one page, with links if needed.

Packet sections (recommended)

  • Context: what problem are we solving?
  • Proposed design: the “happy path” flow
  • Alternatives: 1–2 serious options and why not
  • Risks: top 3, with mitigations
  • Rollout: migration + backwards compatibility
  • Ops: metrics/logs/traces + on-call ownership

Two diagrams that pay off

  • Context diagram: actors + systems + trust boundaries
  • Sequence / data flow: one critical path request end-to-end
  • Mark: queues, retries, timeouts, caches
  • Mark: data stores and data classification (PII/sensitive?)

If your team struggles with “review packets” staying consistent, standardize the decision part with a tiny ADR. Copy this template and keep it in your repo.

# ADR: <short decision title>

- **Status:** Proposed | Accepted | Deprecated
- **Date:** YYYY-MM-DD
- **Owner:** <team/person>

## Context
What problem are we solving? What constraints matter (latency, cost, compliance, team boundaries)?

## Decision
What did we decide? Be specific (technology choice, pattern, contract shape, ownership).

## Options considered
1) Option A — why it helps, why it hurts
2) Option B — why it helps, why it hurts
3) Do nothing — what happens if we don’t change anything

## Consequences
- Positive outcomes we expect
- New risks introduced (and how we’ll mitigate them)
- Operational impact (on-call, dashboards, runbooks)

## Revisit triggers
What evidence would make us change our mind (scale thresholds, incident types, cost limits)?

Step 2 — Validate contracts and data flows (the “breakage” pass)

This pass is about integration reality. Ask: what breaks if someone changes something? That includes your own team six months from now.

Area Checklist questions Practical evidence
API changes Is it backward compatible? Is versioning explicit? What’s the deprecation plan? OpenAPI/contract doc, compatibility tests, migration timeline
Events Schema evolution rules? At-least-once semantics? Idempotency strategy? Event schema registry rules, consumer contract tests
Data Who owns the schema? What are retention/PII rules? How do we backfill? Data dictionary, ownership doc, backfill playbook
Dependencies Timeouts/retries? Circuit breakers? Rate limits? What if dependency is degraded? Client configs, SLOs, known failure handling in code

Step 3 — Do a failure-mode walkthrough (the “incident” pass)

Pick one critical path and walk it like an incident commander. You’re looking for missing defaults: timeouts, backpressure, degraded modes, and unclear ownership.

Walkthrough prompts

  • What if the DB is slow for 10 minutes?
  • What if the queue backs up for 1 hour?
  • What if we deploy a buggy version?
  • What if one tenant/customer is noisy?
  • What if a downstream service starts returning 500s?

Healthy defaults you want to see

  • Explicit timeouts (client and server)
  • Bounded retries with backoff + jitter
  • Idempotency for retries and replays
  • Graceful degradation and feature flags
  • Backpressure / queue limits / load shedding
Keep it concrete

Ask reviewers to point to where timeouts, limits, and safety mechanisms live: config keys, manifests, libraries, or runbooks. “We’ll add it” is not evidence.

Step 4 — Check operability (debugging is a feature)

Operability is where architecture meets real life. If you can’t explain how you’ll know it’s broken and how you’ll fix it, you’re shipping a system that will be debugged in production.

Minimum operability checklist

  • Service-level dashboards: latency, error rate, saturation
  • Tracing across the critical path (at least between services you own)
  • Structured logs with request IDs (and sensitive data redaction)
  • Alerts tied to user impact (SLO burn rate beats noisy thresholds)
  • Runbook: what to do for the top 3 incidents

If you deploy to Kubernetes, reviewers will often ask for “production defaults” (resources, probes, disruption handling). Here’s a small, review-friendly snippet you can adapt.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-service
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: app
          image: example/service:1.2.3
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "1000m"
              memory: "512Mi"
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
          securityContext:
            runAsNonRoot: true
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true

---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: example-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: example-service

Step 5 — Make review outcomes actionable (decisions + follow-ups)

A review ends with a small list of actions that reduce risk measurably. If you get 25 action items, you didn’t prioritize. If you get zero action items, you probably didn’t look at the risky parts.

Good action items are

  • Specific: “Add timeout=2s to outbound call X”
  • Owned: named person/team
  • Timed: before launch, or by a specific date
  • Risk-linked: tied to a failure mode

Bad action items sound like

  • “Improve reliability”
  • “Add better monitoring”
  • “Consider caching”
  • “Make it scalable”

Want to make reviews faster? Automate the “gathering” step so the packet always includes the same basics: version, endpoints, dependencies, and deployment config.

#!/usr/bin/env bash
set -euo pipefail

# Create a simple "review bundle" folder you can attach to a doc or PR.
# Assumes you're running from a service repo.

OUT="review-bundle"
rm -rf "$OUT"
mkdir -p "$OUT"

echo "Commit:" > "$OUT/meta.txt"
git rev-parse HEAD >> "$OUT/meta.txt"
echo "Date:" >> "$OUT/meta.txt"
date -Iseconds >> "$OUT/meta.txt"

# Capture key files (adjust paths for your repo)
cp -f README.md "$OUT/" 2>/dev/null || true
cp -f docs/architecture.md "$OUT/" 2>/dev/null || true
cp -f openapi.yaml "$OUT/" 2>/dev/null || true
cp -f docker-compose.yml "$OUT/" 2>/dev/null || true

# K8s manifests (if present)
if [ -d k8s ]; then
  tar -czf "$OUT/k8s-manifests.tgz" k8s
fi

# Quick dependency snapshot (language-agnostic heuristic)
( ls -1 package.json requirements.txt go.mod Cargo.toml pom.xml 2>/dev/null || true ) > "$OUT/deps-files.txt"

# Basic "what changed" context vs main (works if main exists locally)
git fetch origin main --quiet 2>/dev/null || true
git log --oneline --decorate origin/main..HEAD > "$OUT/changes.txt" 2>/dev/null || git log --oneline --decorate -20 > "$OUT/changes.txt"

echo "Bundle created at: $OUT/"
echo "Tip: attach it to the review packet and link the folder from the PR."
If your review keeps stalling

Move debates from “Is X better than Y?” to “What risk are we buying or reducing?” Most choices are fine if their failure modes and ops plan are explicit.

Common mistakes

These are the patterns behind “it seemed fine in dev” incidents. The fixes are usually simple—but only if you notice the pattern early.

Mistake 1 — Reviewing too late

If key choices are already implemented, reviewers can only nitpick.

  • Fix: run the review when the packet exists but the code is still cheap to change.
  • Fix: time-box: 60–90 minutes plus follow-ups.

Mistake 2 — Missing non-functional requirements

Teams talk features, then reliability/security shows up as a surprise.

  • Fix: define SLO/latency target and error costs up front.
  • Fix: tie each major decision to a quality attribute.

Mistake 3 — “We’ll add observability later”

Later becomes production debugging with no visibility.

  • Fix: require a minimal dashboard + runbook before launch.
  • Fix: treat logs/metrics/traces as acceptance criteria.

Mistake 4 — Unbounded retries and fan-out

This turns small outages into system-wide incidents (retry storms).

  • Fix: timeouts, bounded retries, and backoff with jitter.
  • Fix: add bulkheads: queue limits, rate limits, circuit breaking.

Mistake 5 — Contracts without versioning

Silent breaking changes are the fastest way to create “random” failures.

  • Fix: version APIs/events, publish deprecation windows, test compatibility.
  • Fix: write “what breaks if we change this field?” in the packet.

Mistake 6 — No rollback or migration plan

Deploys become irreversible, and you lose your safety net.

  • Fix: plan backwards-compatible migrations and feature-flagged rollout.
  • Fix: define a rollback trigger (metrics threshold) before launch.
Review smell test

If the review spends most of its time on naming, frameworks, or personal preferences, you probably haven’t clarified failure modes, contracts, or quality bars.

FAQ

How long should an architecture review take?

For a new service or major change, target 60–90 minutes for the meeting plus a short async packet review. If it takes longer, the scope is probably too large (split it), or the packet is missing key details (tighten the template).

What should be in an architecture review packet?

At minimum: the goal, the proposed flow, key interfaces/contracts, top dependencies, top risks, and the rollout/ops plan. If a reviewer asks “where does data go?” or “what happens when X fails?” the packet should answer it without digging through code.

Do small teams really need architecture reviews?

Yes—but keep them lightweight. Small teams are even more vulnerable to “tribal knowledge” and implicit decisions. A short checklist and a couple of recorded decisions prevent rework and reduce on-call pain.

What’s the difference between an architecture review and a design doc?

A design doc describes what you plan to build. An architecture review is the evaluation of that plan against reliability, scalability, security, operability, and maintainability concerns—especially at system boundaries.

What if reviewers disagree?

Anchor disagreement to outcomes: cost of failure, latency targets, data correctness requirements, team ownership, and time-to-deliver. If you can’t resolve it quickly, capture an ADR with the options and revisit triggers so the decision is reversible and evidence-driven.

How do we keep reviews from becoming bureaucracy?

Time-box them, standardize the packet, and insist on outcomes: decisions, owners, and dates. The goal is fewer surprises in production, not “more documents.”

Cheatsheet

Copy/paste this into a doc, ticket, or PR description. It’s the architecture review checklist in scan-fast form.

Context & scope

  • What user outcome changes?
  • What’s explicitly out of scope?
  • Who owns the service long-term?
  • What does “done” mean (quality bar)?

Boundaries & contracts

  • APIs/events/data schemas versioned?
  • Backward compatibility strategy documented?
  • Idempotency rules defined?
  • Error model and timeouts explicit?

Reliability & scaling

  • Critical path identified (sequence diagram)?
  • Failure modes + mitigations listed?
  • Retries bounded; backoff + jitter used?
  • Backpressure/load shedding strategy?
  • Capacity story (what scales with traffic)?

Security & data

  • AuthN/authZ model clear (least privilege)?
  • Sensitive data flows identified and minimized?
  • Secrets handled correctly (no logs, no repo)?
  • Auditability needs (who did what)?

Operability & rollout

  • Dashboards: latency, errors, saturation?
  • Traces/logs with request IDs and redaction?
  • Alerts tied to user impact (SLO-based)?
  • Migration plan + rollback plan?
  • Feature flags / gradual rollout strategy?

Review outcomes

  • 3–7 decisions recorded (ADR or packet)?
  • Action items have owners + dates?
  • One explicit “not doing” decision captured?
  • Packet saved in repo for future readers?
Make it habitual

The easiest way to keep architecture reviews lightweight is to run them often in small scopes. Big-bang reviews feel heavy because the change is too large to reason about.

Wrap-up

A good architecture review checklist is a force multiplier: it turns scattered experience into a repeatable process. You don’t need perfect foresight—just consistent habits that catch the big risks early.

If you do nothing else

  • Write the one-page packet (goal, flow, contracts, risks, rollout)
  • Walk the critical path and name failure modes
  • Make operability a requirement (dashboards + runbook)
  • Record decisions (ADRs) and assign owners

Want to go deeper on the pieces that reviews frequently uncover? These posts are a natural next step: Clean Architecture, Event-Driven Architecture, and Design for Observability.

Quiz

Quick self-check. One correct answer per question.

1) When is the best time to run an architecture review checklist for a new service?
2) Which item is most important to include in a review packet?
3) What is an ADR primarily used for?
4) Which combination best reduces the risk of retry storms?