“Silent” data bugs are the worst kind: your dashboards still render, pipelines still run, and decisions quietly drift. This post shows how to make data quality stick by combining three practical guardrails: tests (fast checks close to the code), contracts (shared definitions between producers and consumers), and alerts (actionable signals with runbooks, not noise).
Quickstart
If you only have an hour, do these in order. The goal is not “perfect data” — it’s preventing the expensive failures (wrong joins, missing days, broken schemas, and quietly duplicated rows).
1) Pick your “critical data products”
Start small: 3–5 tables/models that power revenue, finance, customer comms, or exec dashboards. Everything else can follow later.
- List the owner (team + Slack/Email)
- List consumers (dashboards, services, ML jobs)
- Define what “broken” means (freshness, completeness, duplicates)
2) Add 5 baseline tests per model
These catch 80% of “why is the metric weird today?” incidents. Keep them cheap and automated.
- Uniqueness: primary keys are unique
- Not null: required fields never missing
- Relationships: foreign keys resolve
- Accepted values: enums don’t drift
- Basic bounds: no negative revenue, no impossible dates
3) Define a freshness SLA + alert
Freshness is the fastest signal that something upstream broke. It’s also easy to explain to stakeholders.
- Pick an SLA per table (e.g., 2h for hourly, 26h for daily)
- Alert only when it matters (SLA breach, not minor delay)
- Include the last successful timestamp in the message
4) Make alerts actionable (one runbook link)
Alerts should answer: what broke, what it affects, who owns it, what to do next. Otherwise they become background noise.
- Put a “next step” checklist in the runbook
- Add a severity level (FYI vs page)
- Route to the owning channel/team
A small set of strict checks on critical data beats a huge set of checks nobody trusts. Start with high-impact tables, tighten them, then expand.
Overview
Data quality fails in predictable ways: a column changes type, a late upstream job shifts timestamps, a join duplicates rows, a backfill creates gaps, or an upstream system emits new “status” values. The hard part isn’t knowing these exist — it’s building a system that catches them every time.
A simple framework: prevent → detect → respond
| Layer | What it does | Examples | When it pays off most |
|---|---|---|---|
| Prevent (contracts) | Stop breaking changes from landing silently | Schema & semantics, versioning, ownership | Many producers/consumers, fast iteration |
| Detect (tests) | Catch wrong or missing data quickly | Uniqueness, not-null, relationships, bounds | Transformations, joins, incremental loads |
| Respond (alerts) | Turn failures into actions, not chaos | Freshness SLA breaches, volume anomalies, runbooks | On-call, business-critical reporting |
This post focuses on pragmatic guardrails you can implement with the tools you already have (warehouse, orchestrator, CI/CD, dbt-style transformations, or lightweight validation code). You’ll learn what to test, where to put the checks, how to design data contracts that don’t rot, and how to alert without creating alert fatigue.
Data quality sticks when it is owned (someone is responsible), automated (checks run without manual effort), and actionable (failures point to a fix, not a mystery).
Core concepts
Before tools and dashboards, it helps to agree on a few definitions and mental models. Most teams struggle with data quality because they treat it as a one-time cleanup instead of an engineering system.
1) Data quality is a set of guarantees (not a vibe)
“Quality” becomes operational when you express it as guarantees consumers can rely on. Think in terms of: schema, freshness, uniqueness, completeness, and validity (values make sense).
Quality dimensions you can actually test
| Dimension | Practical definition | Typical check | Failure symptom |
|---|---|---|---|
| Schema | Columns exist with expected types | Contract/schema validation | Downstream jobs error or mis-cast |
| Freshness | Data arrives within an SLA | Max(timestamp) vs now() | Dashboards “freeze” or lag |
| Uniqueness | Identifiers are not duplicated | count(*) = count(distinct id) | Metrics inflate, joins explode |
| Completeness | Required fields are present | not-null + missing-rate | Null-heavy reports, broken segments |
| Validity | Values fall in sane ranges/sets | bounds + accepted values | Negative revenue, impossible dates |
2) Tests are “unit tests for data”
A good data test is cheap, deterministic, and close to the logic that produces the data. That’s why many teams put tests next to transformations (e.g., model definitions), and run them in CI and on schedule.
What tests are great at
- Catch breaking changes immediately
- Prevent “quietly wrong joins”
- Document assumptions as code
- Keep refactors safe
Where tests struggle
- Slow queries on huge tables (avoid expensive scans)
- Gradual drift (needs anomaly monitoring)
- Ambiguous business logic (needs clear definitions/contracts)
3) Data contracts are shared agreements
A data contract turns “we think this table looks like…” into “this is what the table guarantees”. Contracts work best when they include: ownership, schema, semantic meaning (what columns represent), and a change policy (how breaking changes are handled).
A contract that takes a week to approve will be ignored. Keep the contract small, version it, and make “breaking change” explicit (rename column, change type, redefine meaning).
4) Alerts should be about impact, not curiosity
A “good” alert is one that a human can act on quickly. A “bad” alert is one that teaches humans to ignore alerts. Use severity levels and only page for breaches that affect SLAs or critical consumers.
An actionable alert contains
- What broke: test name, dataset/model, failing condition
- How bad: counts, thresholds, last good timestamp
- Impact: which dashboards/services are affected
- Owner: who is responsible, where to ask
- Runbook: one link with next steps
Step-by-step
This is a practical blueprint you can implement incrementally. You don’t need a giant platform rewrite — you need a consistent approach: contracts for expectations, tests for correctness, and alerts for response.
Step 1 — Inventory critical assets and define SLAs
If everything is critical, nothing is. Pick a small set of tables/models that your business would notice if they went wrong. Then define SLAs that match reality.
Pick your first 3–5 assets
- Exec dashboards (revenue, retention, growth)
- Billing/invoicing exports
- Customer messaging lists
- Core product metrics tables
Define minimal SLAs
- Freshness: “data is updated within X minutes/hours”
- Completeness: “we have data for all expected partitions”
- Correctness proxies: “IDs unique; key fields not null”
Step 2 — Write a lightweight data contract (one page, versioned)
Your contract should be human-readable and machine-checkable. Don’t start with a 40-page spec. Start with schema, ownership, and a few semantic checks that matter.
If a change could silently alter a business metric, it belongs in the contract (or the tests derived from it).
version: 1
dataset: analytics.orders_daily
owner:
team: data-platform
channel: "#data-oncall"
description: "Daily order rollup used by Finance and KPI dashboards."
columns:
- name: order_id
type: bigint
nullable: false
description: "Stable unique identifier for an order."
- name: order_date
type: date
nullable: false
description: "Business date in UTC (not event timestamp)."
- name: revenue
type: numeric
nullable: false
description: "Gross revenue in reporting currency."
checks:
- name: primary_key_unique
expr: "count(*) = count(distinct order_id)"
- name: revenue_non_negative
expr: "min(revenue) >= 0"
sla:
freshness_minutes: 120
expected_partitions: "daily"
change_policy:
breaking_changes:
- "rename or remove a column"
- "change column type"
- "change meaning of a column"
process: "open PR + notify consumers + bump contract version"
The contract is your shared source of truth. You can keep it in the same repo as your transformations and review it like code. The key is versioning: when meaning changes, bump the version so consumers can react.
Step 3 — Convert the contract into tests close to the transformations
Contracts define guarantees; tests enforce them. Keep the enforcement near the logic that produces the data (models, transformations, or warehouse views) so failures are easy to trace.
The “baseline test pack” for every important model
| Test type | Why it matters | Typical threshold |
|---|---|---|
| unique(id) | Prevents duplication and metric inflation | 0 duplicates |
| not_null(required_field) | Stops missing keys from breaking joins | 0 nulls (for required) |
| relationships(fk -> dim.pk) | Ensures referential integrity | 0 orphan rows (or explicitly allow) |
| accepted_values(status) | Detects upstream enum drift | strict allowlist |
| bounds(revenue) | Catches impossible values fast | min/max sanity bounds |
version: 2
models:
- name: orders_daily
description: "Daily order rollup used by KPI dashboards."
columns:
- name: order_id
tests:
- unique
- not_null
- name: order_date
tests:
- not_null
- name: status
tests:
- accepted_values:
values: ["paid", "refunded", "chargeback", "pending"]
- name: revenue
tests:
- not_null
tests:
- relationships:
to: ref('dim_customers')
field: customer_id
sources:
- name: app
tables:
- name: orders
loaded_at_field: created_at
freshness:
warn_after: {count: 90, period: minute}
error_after: {count: 120, period: minute}
Notes: keep tests small and fast. For large tables, prefer tests on recent partitions or incremental slices (e.g., last 1–3 days) unless you truly need full history scans.
Step 4 — Add data observability checks (freshness + volume + anomaly)
Tests catch “hard” violations. Observability checks catch “soft” drift: sudden drops in row counts, missing partitions, or changes in key distributions. These are often early indicators of upstream issues.
Freshness checks
- Compute max(event_time) or max(loaded_at)
- Compare against SLA (not “as fast as possible”)
- Escalate after SLA breach, not on minor delays
Volume & shape checks
- Row count vs trailing average (weekday-aware if needed)
- Null rate spikes in key columns
- New values in low-cardinality columns
If you alert on every fluctuation, people stop trusting the system. Prefer thresholds tied to impact (SLA, business-critical metrics) and add “FYI” channels for low-severity signals.
Step 5 — Wire alerts with routing and runbooks
Alerts are the “last mile.” Make them useful. The best data quality programs win not by having the most checks, but by having the most fixes per alert.
A minimal runbook template
- Symptoms: what consumers see (which dashboard/job)
- Likely causes: top 3 upstream candidates
- First checks: last successful run, upstream freshness, schema diffs
- Rollback/fallback: use previous partition, disable downstream job, or mark data stale
- Owner escalation: who to page if unresolved
Step 6 — Add a tiny “quality gate” in CI and a scheduled monitor
Run tests in two places: (1) CI on code changes (prevents regressions), and (2) scheduled in your orchestrator (catches upstream changes and late data). The code below shows a lightweight pattern for validating a dataset slice and sending one actionable alert.
import os
import datetime as dt
import requests
# Keep thresholds simple and tied to impact.
FRESHNESS_MINUTES_SLA = 120
MIN_ROWS_TODAY = 1000 # example: expected daily volume lower bound
SLACK_WEBHOOK = os.environ.get("SLACK_WEBHOOK_URL", "")
def send_alert(title: str, body: str) -> None:
if not SLACK_WEBHOOK:
raise RuntimeError("Missing SLACK_WEBHOOK_URL")
payload = {"text": f"*{title}*\\n{body}"}
requests.post(SLACK_WEBHOOK, json=payload, timeout=10)
def minutes_since(ts: dt.datetime) -> int:
return int((dt.datetime.utcnow() - ts).total_seconds() // 60)
def quality_check(last_loaded_at: dt.datetime, rows_today: int) -> None:
freshness = minutes_since(last_loaded_at)
freshness_ok = freshness <= FRESHNESS_MINUTES_SLA
volume_ok = rows_today >= MIN_ROWS_TODAY
if freshness_ok and volume_ok:
return
problems = []
if not freshness_ok:
problems.append(f"- Freshness SLA breached: last load {freshness} min ago (SLA {FRESHNESS_MINUTES_SLA} min)")
if not volume_ok:
problems.append(f"- Volume too low: rows_today={rows_today} (min {MIN_ROWS_TODAY})")
send_alert(
title="Data quality alert: analytics.orders_daily",
body="\\n".join(problems) + "\\n\\nOwner: #data-oncall\\nRunbook: (link your internal runbook here)"
)
# Example usage: in real life, fetch these from your warehouse query.
if __name__ == "__main__":
simulated_last_loaded = dt.datetime.utcnow() - dt.timedelta(minutes=185)
simulated_rows_today = 420
quality_check(simulated_last_loaded, simulated_rows_today)
This pattern is intentionally boring: deterministic thresholds, one message, one owner, one runbook. You can scale it up later — but this alone prevents many “we found out from the CEO” incidents.
Step 7 — Iterate with an incident-driven backlog
The easiest way to prioritize data quality work is to treat each real incident as a backlog item: what check would have caught this earlier? Add that check, then move on.
After every incident, capture
- Root cause (schema change, late data, logic bug, backfill)
- Blast radius (which consumers were wrong)
- Detection gap (why it wasn’t caught)
- New guardrail (test/contract/alert) to prevent repeat
A healthy quality program looks like
- Fewer repeat incidents over time
- Faster time-to-detect (TTD) and time-to-resolve (TTR)
- Smaller blast radius (issues contained)
- More trust in metrics (less manual validation)
Common mistakes
Data quality doesn’t fail because teams don’t care. It fails because teams build checks that are too expensive, too vague, or too noisy. Here are the patterns that show up again and again — and the fixes.
Mistake 1 — “We’ll clean it later” (no guardrails)
Without automation, quality work becomes a periodic cleanup — and then silently regresses.
- Fix: add baseline tests to critical models (unique, not null, relationships).
- Fix: run tests in CI and on schedule.
Mistake 2 — Treating dashboards as monitors
If your first alert is “someone noticed the chart looks weird,” your detection is too late.
- Fix: freshness SLAs with alerts before business hours.
- Fix: volume and null-rate anomaly checks.
Mistake 3 — Tests that scan full history (slow & brittle)
Expensive tests get skipped, throttled, or disabled — which defeats the point.
- Fix: test recent partitions (last N days) for large tables.
- Fix: keep a small set of strict checks; add more only when needed.
Mistake 4 — Alerting on everything (noise wins)
When alerts are constant, nobody responds. Your monitoring becomes a decoration.
- Fix: only page on SLA breaches or high-impact failures.
- Fix: split severity (FYI channel vs on-call page).
Mistake 5 — No ownership (everyone’s problem = nobody’s problem)
If failures don’t route to a clear owner, they bounce around and stay broken longer.
- Fix: put owners in contracts and alert routing.
- Fix: define escalation paths in runbooks.
Mistake 6 — Contract drift (definitions change silently)
A column’s meaning can change without the name changing — that’s how “correct” data becomes wrong.
- Fix: version contracts when meaning changes.
- Fix: communicate breaking changes before they land.
If a failure can cost money or damage trust, enforce it with a test or SLA. If it’s merely “interesting,” log it as an FYI signal.
FAQ
What’s the difference between data tests and data contracts?
Contracts define the promise (schema + meaning + change policy). Tests enforce the promise by checking data and transformations automatically. A contract without tests becomes documentation; tests without a contract often miss the most important semantic guarantees.
Where should data quality checks live: in the warehouse, the pipeline, or CI?
Use all three, but keep it simple: run fast regression tests in CI (so code changes don’t break outputs), and run scheduled checks in your orchestrator (so upstream delays and backfills are detected). Warehouse-side assertions are great when they’re cheap and close to the dataset being validated.
How many tests should I start with?
Start with 5 baseline tests per critical model (unique, not null, relationships, accepted values, bounds), plus a freshness SLA. Expand only when you have a real incident that suggests a new check.
How do I avoid alert fatigue?
Tie alerts to impact. Page only on SLA breaches or critical correctness failures, and send “FYI” drift signals to a non-paging channel. Also ensure every alert has a clear owner and a runbook link.
What should a “good” data contract include at minimum?
Owner, schema (columns/types/nullability), semantic notes for tricky columns, freshness expectations, and a change policy defining what counts as breaking and how it’s communicated.
Do I need a dedicated data quality tool to do this?
No. You can implement durable guardrails with your existing stack: transformation tests, a few warehouse queries, and an alerting channel. Dedicated tools can help with lineage, anomaly detection, and UI — but the core system is definitions + automation + ownership.
Cheatsheet
A scan-fast checklist for building data quality that sticks. Copy this into your team wiki and iterate as you learn.
Baseline for every critical model
- Owner + alert channel defined
- Primary key uniqueness test
- Not-null tests on required fields
- Relationships test to key dimensions
- Accepted values for enums/status
- Bounds/sanity checks on amounts & timestamps
Monitoring & alerting basics
- Freshness SLA per dataset
- Row count anomaly (vs trailing avg)
- Null-rate anomaly on key columns
- Severity levels (FYI vs page)
- Runbook link in every alert
- Routing to dataset owner
Suggested thresholds (starter defaults)
| Signal | Starter threshold | Action |
|---|---|---|
| Freshness | Warn at 75% of SLA, error at 100% of SLA | Notify owner; page only on SLA breach |
| Row count | > 30% drop vs trailing 14-day median (weekday-aware if needed) | Investigate upstream extract/backfill |
| Null rate | Any spike above agreed baseline (e.g., +2–5%) | Check schema changes, parsing issues |
| New enum values | Any value not in allowlist | Decide: add value, map value, or reject |
Maintain a single list: incident → guardrail. Every real incident should produce a new contract clause, test, or monitor that prevents repeats.
Wrap-up
Data quality that sticks isn’t about building the biggest rule set — it’s about building a system people trust: contracts define the promises, tests enforce the promises, and alerts make failures actionable.
If you want a simple next-week plan
- Day 1: pick 3–5 critical data products + define freshness SLAs
- Day 2: add baseline tests (unique/not-null/relationships/accepted values/bounds)
- Day 3: write a lightweight contract for each critical dataset (owner + schema + semantics)
- Day 4: wire one alert per dataset with routing + runbook
- Day 5: review the first failures and convert them into permanent guardrails
From here, the best improvements are incident-driven: every time the business sees “weird numbers,” turn it into a new automated check. After a few cycles, the number of repeat incidents drops — and trust rises.
Perfect data is not the goal. Predictable, explainable data is. Build guardrails that keep meaning stable, catch breaks early, and tell humans what to do next.
Quiz
Quick self-check (demo). This quiz is auto-generated for data / engineering / databases.