Cloud & DevOps · Observability

Observability 101: Logs vs Metrics vs Traces (And When You Need Each)

Stop guessing and start seeing what your systems do.

Reading time: ~8–12 min
Level: All levels
Updated:

Production issues are rarely “one bug.” They’re usually a chain: a slow DB query, a retry storm, a cache miss pattern, a bad deploy, or an external dependency wobble. Observability is how you see the chain quickly. This post breaks down the three core signals—logs vs metrics vs traces—and gives you a practical way to use each without blowing up costs or drowning in dashboards.


Quickstart

If you want results fast, don’t “implement observability.” Pick one real user journey, instrument it, and make it easy to answer: Is it broken?Where?Why? These are the highest-leverage steps for most teams.

1) Start with one service + one critical endpoint

Observability works best when it’s anchored in a real “golden path” (checkout, login, search, upload, etc.). Don’t instrument everything at once.

  • Pick a single user-facing request (e.g., POST /checkout)
  • Write a simple SLO target (latency + error rate)
  • Decide what “bad” looks like (p95 latency, 5xx spikes)
  • Keep the first scope small so you actually finish

2) Add RED metrics first (fastest signal)

Metrics are your “smoke detector.” They tell you something is wrong before users start filing tickets.

  • Rate: requests per second
  • Errors: error ratio (5xx, timeouts)
  • Duration: p50/p95/p99 latency
  • Break down by service, route, status, env

3) Switch logs to structured JSON (and keep them boring)

Logs are your “black box recorder.” They’re most useful when they’re consistent and searchable. Unstructured logs become expensive storytelling.

  • Log events as JSON with stable keys (not free-text)
  • Include service, env, version, request_id
  • Log errors with stack traces once (avoid duplicate spam)
  • Redact PII; don’t log secrets

4) Enable tracing for the same request path

Traces connect the dots across services, queues, caches, and databases. They’re the fastest way to answer “what part of the request was slow?”

  • Instrument only the golden path first
  • Propagate context across HTTP and async jobs
  • Sample intelligently (100% errors, small % of success)
  • Capture span attributes like db.system, http.route
A simple mental model

Metrics tell you something is wrong (zoom out). Traces tell you where it’s wrong (follow the request). Logs tell you why it’s wrong (the specific event and context).

The quickest way to lose observability

Don’t start by collecting “everything.” Start by collecting useful signals. Cost and noise kill observability faster than missing features.

Overview

Observability is the ability to understand a system’s internal state by looking at the data it emits. In practice, it means you can answer new questions during an incident—without shipping new code or guessing. Classic monitoring is often known unknowns (“CPU > 90%”), while observability helps with unknown unknowns (“why did checkout stall only for EU users?”).

Logs vs metrics vs traces: the practical differences

Signal What it is Best for Weak spot
Metrics Aggregated numbers over time (time series) Alerting, trends, SLOs, capacity planning Hard to explain “why” without context
Logs Discrete events (usually text/JSON) Forensics, debugging edge cases, audit trails Noisy/expensive at scale; easy to make unsearchable
Traces A map of one request across services (spans) Root cause localization, latency breakdown, dependency issues Requires instrumentation + sampling strategy

This post is not a vendor tutorial. It’s a decision guide you can apply to any stack (Kubernetes or not, microservices or monoliths). You’ll learn:

What you’ll be able to do after reading

  • Choose the right signal for the question you’re answering
  • Design a minimal, useful baseline (RED/USE + structured logs)
  • Instrument tracing without turning it into a cost monster
  • Correlate logs ↔ traces so you can pivot instantly
  • Avoid the most common traps (cardinality, spammy alerts, noisy logs)

What this post intentionally does not do

  • Pick “the best” tool (tools change; concepts last)
  • Recommend collecting every possible metric/log/span
  • Replace good incident response habits (runbooks, ownership)
  • Hide the trade-offs (cost, privacy, complexity)
Why this matters

Faster debugging isn’t just “nice.” It reduces downtime, limits blast radius, and makes deploys safer. The goal is simple: shorten time-to-detect and time-to-explain.

Core concepts

Before you pick tools, you need the “shape” of each signal in your head. Most observability pain comes from mismatched expectations: people try to use logs for alerting, metrics for root-cause narratives, or traces as a full audit log. Each signal shines in a different stage.

The zoom lens model

Zoom out: metrics

Metrics summarize behavior. They are compact, cheap to store, and perfect for “is something drifting?” over time.

  • Great for SLIs/SLOs (error rate, latency percentiles)
  • Great for alerting (low noise when designed well)
  • Best when labels are controlled and low-cardinality

Zoom in: logs

Logs capture specific events. They are high volume but high detail. Logs answer “what happened to this particular request/user?”

  • Best for debugging odd edge cases
  • Best for auditing (when done responsibly)
  • Most useful when structured + correlated

Follow the request: traces

A trace is a collection of spans (timed operations) linked together. It shows the path through services and dependencies. It’s the “map” that explains where time and failures accumulate.

  • Best for latency breakdown (DB vs cache vs external API)
  • Best for dependency graphs and fan-out issues
  • Most powerful when you can pivot from a trace to relevant logs

Logs: what “good” looks like

Logs become useful when they are predictable. “Predictable” means two things: (1) the same event type produces the same fields, and (2) you can filter by stable dimensions like service, environment, version, route, and request/trace ID.

A minimal structured log shape

Field Why it exists Example
ts Sorting and correlation across systems 2026-01-09T14:22:10.123Z
level Noise control (info/warn/error) error
service, env, version Slice investigations (deploys, staging vs prod) checkout-api, prod, 1.12.3
trace_id / request_id Instant pivot from logs ↔ traces 4bf92f3577b34da6a3ce929d0e0e4736
msg + structured context Human explanation + machine filters "payment failed" + {"provider":"..."}

Metrics: the cardinality rule (the silent killer)

Metrics are aggregates. That means you should be very careful about what you put in metric labels/dimensions. If a label has too many unique values (high cardinality), you can explode storage and query costs. Logs handle high-cardinality context much better.

Good metric labels

  • service, env, region
  • route (templated, not raw URLs)
  • status or coarse error category
  • method (GET/POST)

Usually bad metric labels

  • user_id, email, session_id
  • Raw URLs with IDs (e.g., /users/123)
  • Full error messages as labels
  • Anything that grows without a hard bound

SLIs, SLOs, and why alerts should be boring

Observability gets powerful when your metrics reflect what users feel. That’s where SLIs and SLOs come in: an SLI is a measurement (e.g., “p95 latency”), an SLO is a target (e.g., “p95 < 300ms, 99.9% success”). Alerts should be tied to SLO risk or immediate user harm—not “interesting graphs.”

Alert design rule

If an alert doesn’t tell someone what to do next (or where to look), it’s a notification, not an alert. Use metrics for paging; use traces/logs for investigation.

Correlation: the real superpower

The best observability setups let you move in one click: metric spikeexample tracerelevant logs. You get that by propagating context (trace/request IDs) across services and adding those IDs to logs.

Without correlation, you’re doing archaeology

If you can’t link a failing request to its trace and logs, debugging becomes “search and guess.” Correlation is often worth more than adding 100 extra metrics.

Step-by-step

Below is a practical path that works for a monolith, microservices, or Kubernetes. The key is sequencing: start with metrics (detection), then add traces (localization), then make logs searchable (explanation). You can stop after any step and still have something useful.

Step 1 — Define your “golden path” and success criteria

  • Pick 1–2 endpoints that represent user value (login, checkout, search)
  • Define acceptable latency (p95 or p99) and acceptable error rate
  • Decide who owns the on-call response (don’t skip ownership)
  • Write a one-paragraph “when this breaks, users feel X” description

Step 2 — Add baseline metrics (RED + dependency health)

Start with a small set that answers: “Is traffic normal?”, “Are errors rising?”, “Is latency climbing?” Then add a small dependency set (DB, cache, external API) so you can rule things out quickly.

Minimum request metrics (per service)

  • Request rate by route + method
  • Error ratio by status class (2xx/4xx/5xx)
  • Latency percentiles (p50, p95, p99)
  • Saturation proxy (queue depth, thread pool usage, CPU)

Label hygiene (keep metrics cheap)

  • Use templated routes (/users/:id, not /users/123)
  • Keep labels bounded (region, env, service)
  • Put high-cardinality context in logs, not metrics
  • Document any “allowed labels” per metric

Step 3 — Centralize ingestion (so services don’t talk to 5 backends)

Even if you’re small, it’s worth having a single ingestion endpoint so your apps export telemetry one way. A common pattern is to run a collector/agent that receives telemetry (e.g., OTLP) and exports to your chosen backend(s).

Quick local test: run an OTLP collector endpoint

This is a minimal smoke test: it gives you an OTLP endpoint on localhost so apps can export. Replace image tags/endpoints later with your real deployment choices.

# 1) Save an OpenTelemetry Collector config as ./otel-collector.yaml (example below)
# 2) Run a local collector that accepts OTLP over gRPC (4317) and HTTP (4318)

docker run --rm -it \
  -p 4317:4317 -p 4318:4318 \
  -v "$(pwd)/otel-collector.yaml:/etc/otelcol/config.yaml" \
  otel/opentelemetry-collector:latest \
  --config /etc/otelcol/config.yaml

# 3) Point your app to the collector
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318"
export OTEL_SERVICE_NAME="checkout-api"

Step 4 — Use one pipeline for logs, metrics, and traces

A good default is: receive OTLP → add resource attributes (service/env/version) → batch → export. That gives you consistent metadata across all signals, which is what makes correlation and filtering easy.

Example collector config (single OTLP exporter)

This exports all three signals to one OTLP endpoint. Many backends accept OTLP directly. If yours uses separate endpoints, split exporters by pipeline—keep the structure the same.

receivers:
  otlp:
    protocols:
      grpc: {}
      http: {}

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
  batch:
    timeout: 2s
    send_batch_size: 8192
  resource:
    attributes:
      - key: deployment.environment
        value: prod
        action: upsert
      - key: service.namespace
        value: unilab
        action: upsert

exporters:
  otlphttp:
    endpoint: https://otlp.example.com
    headers:
      x-api-key: ${OTEL_API_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlphttp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlphttp]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlphttp]

Step 5 — Instrument tracing on the golden path (and sample wisely)

Tracing is most valuable when it helps you answer: “Where did the time go?” and “Which dependency is failing?” Start with automatic HTTP instrumentation, then add a few custom spans around your most important internal steps.

Sampling defaults that work

  • Sample a small % of successful requests (e.g., 1–10%)
  • Sample 100% of errors (5xx/timeouts) if possible
  • Sample more during an incident (temporary override)
  • Keep enough context in spans (route, status, dependency)

What not to do

  • Don’t turn on 100% tracing everywhere “just in case”
  • Don’t store secrets/PII in span attributes
  • Don’t create spans for every loop iteration
  • Don’t forget context propagation (otherwise traces break)

Minimal Node.js tracing setup (OpenTelemetry SDK)

This example starts OpenTelemetry in a Node app and exports traces to an OTLP endpoint (collector or backend). In real projects you’ll also export metrics/logs and add resource attributes like version, region, and environment.

// tracing.js
// Run this before your app code (e.g., with: node -r ./tracing.js server.js)

const { NodeSDK } = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-http");
const { Resource } = require("@opentelemetry/resources");
const { SemanticResourceAttributes } = require("@opentelemetry/semantic-conventions");

const exporter = new OTLPTraceExporter({
  // Collector default: http://localhost:4318/v1/traces
  url: process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT || "http://localhost:4318/v1/traces",
});

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: process.env.OTEL_SERVICE_NAME || "checkout-api",
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.OTEL_ENVIRONMENT || "dev",
  }),
  traceExporter: exporter,
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

process.on("SIGTERM", async () => {
  try { await sdk.shutdown(); } finally { process.exit(0); }
});

Step 6 — Make logs searchable (and attach the trace/request ID)

Logs are where the narrative lives: errors, retries, unusual branches, validation failures, and context that doesn’t belong in metrics labels. The trick is to keep logs consistent and correlatable.

Structured logging checklist

  • Use JSON logs with stable keys (service, env, version, route)
  • Include trace_id or request_id in every request-scoped log line
  • Prefer one error log with a stack trace over 20 repeated lines
  • Redact PII; keep payload logging behind a feature flag with sampling

Step 7 — Alert on symptoms, investigate with traces/logs

Your pager should wake you up only when users are likely impacted or you’re burning through your error budget quickly. Metrics are the right signal for this because they’re stable and cheap to query continuously.

A clean incident workflow

Alert (metrics) → open dashboard to confirm → pick a failing slice → open an example trace → pivot to the logs for that trace/request. If you can do this in under 2 minutes, your observability is working.

Common mistakes

Most “observability failures” aren’t missing tools—they’re missing discipline: inconsistent naming, uncontrolled labels, noisy logs, and alerts that don’t map to user impact. Here are the pitfalls that show up over and over (and how to fix them).

Mistake 1 — Using logs as your primary alerting system

Log-based alerts often become noisy, expensive, and brittle. They can be useful, but they’re rarely the best default.

  • Fix: alert on RED/USE metrics tied to user impact.
  • Fix: use logs for investigation and forensics, not paging.

Mistake 2 — High-cardinality metric labels

Putting user_id or raw URLs into metric labels can explode your time series count and make the system slow and expensive.

  • Fix: keep labels bounded (service/env/route/status).
  • Fix: put high-cardinality details in logs, not metrics.

Mistake 3 — Traces without context propagation

If trace context isn’t passed across services and async jobs, your “distributed trace” becomes a set of disconnected fragments.

  • Fix: standardize propagation headers and libraries.
  • Fix: validate that a request stays in the same trace across hops.

Mistake 4 — Unstructured logs (“printf observability”)

Free-text logs are hard to query consistently. During incidents, you’ll spend time guessing the right grep pattern.

  • Fix: switch to structured JSON logs with stable keys.
  • Fix: add trace_id/request_id so you can pivot from traces.

Mistake 5 — Alert fatigue (everything pages)

If everything alerts, nothing alerts. Teams start ignoring the pager, and your real failures hide in the noise.

  • Fix: page only on user-impacting symptoms or fast error-budget burn.
  • Fix: move “interesting” signals to dashboards, not the pager.

Mistake 6 — No version/environment metadata

If you can’t slice by deploy version or environment, you’ll waste time asking “did this start after the deploy?”

  • Fix: add service, env, version, region everywhere.
  • Fix: treat naming conventions as part of your platform, not a suggestion.
The litmus test

If a new engineer can’t answer “is it broken, where, why” using your dashboards within 10 minutes, the problem is usually consistency and correlation, not missing features.

FAQ

Do I really need logs, metrics, and traces?

Not on day one. Most teams should start with metrics for alerting and trends, then add traces for root-cause localization, and keep logs structured for explanations and edge cases. You can ship meaningful observability with metrics alone—but you’ll hit a ceiling when debugging cross-service latency or intermittent errors.

What’s the simplest “good” metric set for web services?

Use RED for requests: Rate, Errors, Duration. Add a small saturation signal (CPU, queue depth, connection pool usage) and a few dependency checks (DB latency, external API error ratio). Keep labels bounded: service, env, route, status.

When should I use distributed tracing instead of logs?

Use tracing when you need to understand a single request across boundaries: service-to-service calls, queues, caches, and databases. Logs tell you what happened in one process; traces show you the path and timing across the whole request.

What does “cardinality” mean and why does it matter?

Cardinality is the number of unique values a label/dimension can take. High-cardinality labels (user IDs, request IDs, raw URLs) can create huge numbers of time series and make metrics storage and queries expensive and slow. Put high-cardinality context in logs/traces instead.

How long should I retain logs?

It depends on compliance, cost, and how often you debug older issues. A common pattern is short hot retention (7–30 days in a fast store) plus longer cold retention (object storage) for audits. If you’re starting out, optimize for the period where you actually investigate incidents.

Is OpenTelemetry required for observability?

No—but it’s a practical standard that makes instrumentation and exporting telemetry more consistent across languages and vendors. The key idea is using a common data model and a common export protocol so your apps don’t become tied to one backend.

How do I keep observability costs under control?

Control volume at the source: limit metric labels, sample traces (especially successes), and keep logs structured with clear retention policies. Make “collect everything” a temporary incident mode, not a permanent default.

Cheatsheet

Print this mentally: pick the right signal, keep labels sane, correlate everything, and let metrics page you (not logs).

Which signal do I use?

If your question is… Start with… Then pivot to…
“Is the service healthy right now?” Metrics (RED/USE, SLO burn) Traces/logs for examples
“Where is latency coming from?” Traces (span breakdown) Logs for error details
“What happened to this one request/user?” Logs (structured, correlated) Trace for request path
“Is this getting worse over weeks?” Metrics (trends) Traces/logs to explain regressions

Metric label rules (keep it cheap)

  • Labels should be bounded (service/env/region/route/status)
  • Never label by user/session/request IDs
  • Template routes (/orders/:id, not raw URLs)
  • Use logs for high-cardinality context

Trace sampling rules (keep it useful)

  • Sample a small % of successes
  • Prefer sampling 100% of errors/timeouts
  • Increase sampling temporarily during incidents
  • Don’t store PII in span attributes

Structured log rules (keep it searchable)

  • JSON logs with stable keys
  • Always include service, env, version
  • Include trace_id / request_id per request
  • Redact secrets/PII; don’t log tokens
  • Throttle noisy errors; avoid duplicate spam

Alert rules (keep it actionable)

  • Page on user impact or fast error-budget burn
  • Dashboards for “interesting,” pager for “urgent”
  • Every page should link to the investigation view
  • Review alerts monthly; delete the ones you ignore
Shortcut for teams under pressure

If you have to pick one improvement this week: make sure every request has a trace ID and that ID appears in logs. That single change often halves investigation time.

Wrap-up

Observability isn’t a dashboard collection. It’s a workflow: detect fast, localize precisely, and explain confidently. The simplest reliable setup is:

  • Metrics for detection and alerting (RED/USE + SLOs)
  • Traces for request-level localization (where time/failure happens)
  • Logs for explanation and forensics (structured + correlated)

Next actions (pick one)

  • Instrument RED metrics for your top endpoint and create one “health” dashboard
  • Enable tracing on the same endpoint and verify context propagation across one downstream call
  • Switch one service’s logs to structured JSON and ensure trace/request IDs are present
  • Delete or downgrade one noisy alert that nobody acts on
How to know you’re winning

Your next incident should feel different: fewer guesses, fewer “who owns this?”, and a faster path from symptom to cause. If that’s true, your observability is improving—even if you have fewer dashboards than before.

If you want to go deeper, continue with the related posts below—especially centralized logging and practical deployment patterns that keep your signals consistent.

Quiz

Quick self-check. This quiz covers logs vs metrics vs traces and common observability decisions.

1) You want to page on-call when users are impacted. Which signal is the best default for alerting?
2) You need to understand where time is spent across multiple services for one slow request. What do you reach for?
3) What is “high cardinality” in metrics, and why is it a problem?
4) What makes logs dramatically more useful during incidents?