dbt is the “missing middle” between raw warehouse tables and analytics you can trust. It turns SQL transformations into a versioned, tested, documented project—so pipelines scale with your team instead of turning into a tangled folder of queries. This guide is a practical, opinionated walkthrough of dbt in practice: how to structure models, add the right tests (not too many), keep docs useful, and deploy without breaking downstream dashboards.
Quickstart
If you want immediate wins, do these in order. You can apply them to a new dbt project or an existing one that’s getting messy.
1) Create three layers (and stick to them)
This is the simplest structure that scales: raw → staging → marts. It reduces duplicated logic and makes PR reviews easier.
- staging: clean + type cast + rename (no business logic)
- intermediate: joins and reusable transformations
- marts: business-ready models (facts/dimensions or reporting tables)
2) Replace SELECT * with explicit columns
Explicit columns prevent silent schema drift and make model diffs readable when upstream sources change.
- Rename columns once (in staging)
- Cast types once (in staging)
- Use consistent timestamps (UTC) and keys
3) Add 5 tests that catch 80% of breakages
Start small. Tests are powerful, but only if you maintain them.
- not_null on primary keys
- unique on primary keys (or surrogate keys)
- relationships between facts → dims
- One or two accepted_values checks for critical enums
- One custom test for a core business rule (e.g., “amount >= 0”)
4) Adopt one command: dbt build
Use dbt build locally and in CI to run models + tests as one pipeline. It encourages “ship only when green.”
- Local dev:
dbt build --select +my_model - PR checks:
dbt build --select state:modified+ - Production job:
dbt build(full DAG or scheduled subsets)
Put “meaning” in marts, not in staging. Staging is for standardization; marts are for decisions. When you separate the two, refactors stop being scary.
Overview
“Transformations” are where data projects succeed or fail. Raw ingestion can be automated, and BI tools are fast to click through—but transformations are where you define business logic, manage schema changes, and enforce quality. dbt makes those transformations reproducible by treating them like software: modular code, reviews, tests, and documentation.
What this post covers
- The mental model: dbt compiles SQL + builds a DAG of dependencies
- Clean project structure: staging/intermediate/marts and naming conventions
- Models that scale: materializations (views/tables/incremental) and performance habits
- Testing strategy: the small set of checks that actually prevent incidents
- Docs that stay alive: descriptions, exposures, and “how to use this model” notes
- Deployment patterns: CI for PRs, environments, and safe releases
| Problem | How it looks in the wild | dbt habit that fixes it |
|---|---|---|
| Inconsistent logic | Two dashboards disagree on the same metric | Single “golden” mart model + refs |
| Silent schema drift | Columns appear/disappear upstream and break reports later | Explicit columns + source tests + docs |
| Slow transformations | Queries get slower as data grows | Incremental models + partition/cluster-aware design |
| Fear of change | No one wants to touch the “core SQL” | Layering + tests + CI for safe refactors |
dbt is primarily the T in ELT: data is loaded into the warehouse first, then transformed inside the warehouse. That’s why dbt projects often start paying off as soon as you have multiple consumers (dashboards, reverse ETL, ML features, audits).
Core concepts
dbt is simple on the surface—“SQL models + Jinja”—but the power comes from a few core ideas. If you internalize these, your projects stay clean as they grow.
Models + DAG (dependency graph)
A dbt model is a SQL select statement that dbt materializes as a view/table/incremental table.
The DAG is built from references: {{ ref('some_model') }} and {{ source('raw', 'table') }}.
dbt uses the DAG to run things in the right order, select subsets for dev, and generate documentation.
- ref: model-to-model dependency, makes renames safe
- source: defines raw inputs and enables freshness/testing
- +model selection: include parents/children in the DAG
Materializations (how models are built)
Materializations control performance and cost. The trick is to choose the simplest option that meets latency needs.
| Type | When to use | Watch out for |
|---|---|---|
| view | Fast iteration, small datasets | Slow queries if downstream stacks too many views |
| table | Stable marts, expensive logic, dashboards | Rebuild cost; schedule appropriately |
| incremental | Large fact tables with append/update patterns | Need correct unique key + incremental filter |
| ephemeral | Small reusable subqueries (compiled inline) | Can explode query complexity if overused |
Layering: staging → intermediate → marts
The most practical “clean dbt” pattern is layering. It enforces a boundary between standardizing data and defining metrics.
| Layer | Primary job | Typical contents | Anti-pattern |
|---|---|---|---|
| staging | Make raw data usable | renames, type casts, dedupe, basic filters | business definitions (“active customer”, “net revenue”) |
| intermediate | Reusable transformations | joins, enrichment, shared calculations | one-off reporting logic for a single dashboard |
| marts | Business-ready outputs | facts/dims, metric tables, reporting models | directly referencing raw sources everywhere |
Tests + docs are part of the transformation
dbt tests aren’t an afterthought. They are the guardrails that keep transformations stable as upstream data changes. And docs aren’t “wiki work”—they are how other humans learn to use (and trust) your models.
Good tests are narrow and high-value
- Prefer tests that catch real incidents (null keys, duplicates, broken joins)
- Write one custom business test only when a failure is costly
- Keep flaky tests out of CI (or they’ll be ignored)
Good docs answer “how do I use this?”
- Describe grain (one row per what?)
- Define keys and join paths
- Explain tricky fields and edge cases
- Link to owners/exposures (dashboards, reports)
A dbt project becomes fragile when marts directly encode raw quirks (“sometimes this column is text, sometimes int”) or when tests are added everywhere without a strategy. Clean dbt is clarity + consistency, not maximalism.
Step-by-step
This walkthrough assumes you already have data landing in your warehouse (via Fivetran/Airbyte/custom pipelines). The goal is to turn that raw data into transformations you can safely iterate on: consistent naming, predictable layers, tests that catch breakage, and a deployment path that doesn’t surprise your stakeholders.
Step 1 — Define sources (raw inputs) before writing models
Start by declaring your sources. This gives you a single place to document raw tables, apply basic checks, and (optionally) run freshness monitoring. It’s also how you stop every model from hardcoding raw schema names.
- Group sources by system (app, billing, product analytics)
- Name raw schemas consistently (e.g.,
raw_app,raw_billing) - Decide what “fresh enough” means per source (hourly/daily)
- Document weird columns (IDs that change format, soft deletes, etc.)
Step 2 — Build staging models (standardize, don’t theorize)
Staging models are the “translation layer” between raw data and the rest of your project. They should be boring: rename columns, cast types, deduplicate, and make timestamps consistent. Avoid business logic here. This is where you win long-term maintainability.
Staging naming convention
- Folder:
models/staging/<source_system>/ - Model prefix:
stg_(e.g.,stg_orders) - Columns: snake_case, explicit types, no
SELECT * - Surrogate keys only if needed for uniqueness
What belongs in staging
- Type casting (
cast(... as ...)) - Timezone normalization
- Soft delete handling (flag + filter strategy)
- Dedupe rules (e.g., latest record per ID)
Below is a practical staging model example with an incremental pattern you can reuse for large tables. (In real projects, some staging models are views; make them incremental only when data volume demands it.)
{{ config(
materialized='incremental',
unique_key='order_id',
on_schema_change='sync_all_columns'
) }}
with src as (
select
id as order_id,
user_id,
status,
currency,
amount_cents,
created_at as created_at_utc,
updated_at as updated_at_utc
from {{ source('app', 'orders') }}
{% if is_incremental() %}
-- Reprocess a small window to catch late updates
where updated_at >= dateadd(day, -3, (select max(updated_at_utc) from {{ this }}))
{% endif %}
),
typed as (
select
cast(order_id as bigint) as order_id,
cast(user_id as bigint) as user_id,
cast(status as varchar) as status,
cast(currency as varchar) as currency,
cast(amount_cents as bigint) as amount_cents,
cast(created_at_utc as timestamp) as created_at_utc,
cast(updated_at_utc as timestamp) as updated_at_utc
from src
),
deduped as (
select *
from typed
qualify row_number() over (partition by order_id order by updated_at_utc desc) = 1
)
select * from deduped
Incremental models are only correct if your filter matches the data’s update behavior. If updates can arrive late, use a rolling reprocessing window (like the example) or an ingestion timestamp. If you don’t, you’ll ship “quietly wrong” tables.
Step 3 — Add schema tests + model docs (where it matters)
Add tests and descriptions next to the models they protect. The goal is to encode assumptions: primary keys should not be null, they should be unique, and facts should join to dimensions. Keep the first set of tests small and focused—then expand only when failures prove they’re useful.
A starter schema.yml you can copy
This example documents grain, adds high-signal tests, and includes one business rule check. Adjust names for your warehouse adapter (types and functions may vary).
version: 2
models:
- name: stg_orders
description: "Staging table for orders. One row per order_id (latest version)."
columns:
- name: order_id
description: "Primary key from the app database."
tests:
- not_null
- unique
- name: user_id
description: "Foreign key to users."
tests:
- not_null
- name: status
description: "Order status emitted by the app."
tests:
- accepted_values:
values: ["pending", "paid", "canceled", "refunded"]
- name: amount_cents
description: "Order amount in cents, non-negative."
tests:
- not_null
- name: fct_orders
description: |
Orders fact table for analytics.
Grain: one row per order_id.
Use cases: revenue reporting, cohort analysis, funnel metrics.
columns:
- name: order_id
description: "Primary key."
tests:
- not_null
- unique
- name: user_id
description: "Join to dim_users on user_id."
tests:
- relationships:
to: ref('dim_users')
field: user_id
- name: net_amount_cents
description: "Net amount after refunds/adjustments, cents."
tests:
- not_null
tests:
- name: orders_amount_non_negative
description: "Net amounts should not be negative."
config:
severity: error
test: |
select order_id
from {{ ref('fct_orders') }}
where net_amount_cents < 0
Step 4 — Build marts (the “meaning layer”)
Marts are where you define business concepts: “active customer”, “net revenue”, “retention cohort”, “conversion”. Keep marts stable and easy to consume. Two practical patterns that work across teams:
Facts & dimensions
Best when analytics needs consistent join paths and a clear grain.
- fct_ tables at a clear event grain (orders, payments, sessions)
- dim_ tables for entities (users, products, accounts)
- Enforce relationships tests from facts → dims
Reporting models
Best when you have a known set of dashboards with specific needs.
- Prefix with rpt_ and name by purpose (e.g.,
rpt_revenue_daily) - Keep logic readable; prefer intermediate models for reusable pieces
- Document the intended dashboard/report in the description
Step 5 — Make performance predictable (incremental + partition-aware design)
As data grows, performance becomes a design problem. dbt gives you the tools; you still need the habits: filter early, avoid huge fan-out joins, and align your fact tables with how they’re queried.
Performance checklist for large marts
- Use incremental for large append/update facts (with a correct incremental filter)
- Pick a stable unique_key and validate uniqueness with tests
- Aggregate early for reporting models (daily/weekly grain) when possible
- Prefer intermediate models for expensive joins reused by multiple marts
- Know your warehouse features (partitioning, clustering, distribution keys)
Step 6 — Deploy safely: CI for PRs + scheduled jobs for production
“Clean dbt” isn’t only code structure—it’s also a release process that keeps breakage out of production. The simplest safe pattern: run a subset in pull requests (what changed + dependencies), then run full jobs on a schedule.
Example GitHub Actions workflow (PR checks)
This runs dbt build for modified models (plus downstream) using state comparison.
Store your warehouse credentials as repository secrets and configure your dbt profile via environment variables.
name: dbt-ci
on:
pull_request:
branches: [ "main" ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dbt
run: |
python -m pip install --upgrade pip
pip install dbt-core dbt-postgres
- name: dbt deps
run: dbt deps
- name: dbt debug
env:
DBT_PROFILES_DIR: ./.dbt
DBT_USER: ${{ secrets.DBT_USER }}
DBT_PASSWORD: ${{ secrets.DBT_PASSWORD }}
DBT_HOST: ${{ secrets.DBT_HOST }}
DBT_DBNAME: ${{ secrets.DBT_DBNAME }}
run: dbt debug
- name: dbt build (state: modified)
env:
DBT_PROFILES_DIR: ./.dbt
DBT_USER: ${{ secrets.DBT_USER }}
DBT_PASSWORD: ${{ secrets.DBT_PASSWORD }}
DBT_HOST: ${{ secrets.DBT_HOST }}
DBT_DBNAME: ${{ secrets.DBT_DBNAME }}
run: |
dbt build --select state:modified+ --defer --state ./target
- Environment parity: dev/staging/prod should differ in data volume and credentials, not in “how dbt runs.”
- Release notes: when you change a mart’s meaning (not just columns), write it down in the PR description.
Step 7 — Keep the project clean over time (review standards)
Your dbt project stays clean if you enforce a few PR conventions. They’re lightweight, but they prevent “SQL sprawl.”
PR checklist (minimal but strict)
- New models have descriptions and basic tests
- Staging models do not contain business logic
- Columns are explicit; no
SELECT * - Model grain is stated (“one row per …”)
- Performance impact considered (joins, incremental needs)
When refactors happen
- Move shared logic into intermediate models
- Add relationships tests before changing join keys
- Deprecate columns with a comment window (don’t yank instantly)
- Use docs to announce meaning changes (especially for metrics)
Common mistakes
Most dbt “pain” comes from a small set of patterns. The good news: each has a simple fix that makes the project cleaner and more reliable.
Mistake 1 — Putting business logic in staging
Staging is for standardization. Business logic in staging spreads meaning everywhere and makes changes risky.
- Fix: keep staging boring: renames, casts, dedupe, basic filters.
- Fix: define metrics and “what things mean” in marts (facts/dims/reporting models).
Mistake 2 — Overusing SELECT *
It feels faster, but it hides schema drift and makes diffs unreadable.
- Fix: list columns explicitly (especially in staging and marts).
- Fix: add
on_schema_changeintentionally for incrementals (don’t let it surprise you).
Mistake 3 — Too many tests, too early
Teams add dozens of tests, then ignore CI because it’s noisy or slow.
- Fix: start with 5–10 high-signal tests (keys, relationships, core constraints).
- Fix: promote tests to CI only when they’re stable and valuable.
Mistake 4 — Incremental models without a safe filter
This produces tables that look correct but miss late updates (the worst kind of incorrect).
- Fix: pick a reliable watermark (updated_at, ingested_at) and reprocess a rolling window.
- Fix: test uniqueness of the incremental key and monitor row counts over time.
Mistake 5 — No grain documentation
If “one row per what?” is unclear, joins multiply rows and metrics explode.
- Fix: document grain in model descriptions (especially marts).
- Fix: add relationship tests and sanity checks for join paths.
Mistake 6 — CI runs everything on every PR
Full builds slow down development and encourage skipping checks.
- Fix: run only modified models + dependencies in PRs (state-based selection).
- Fix: run full jobs on schedule in production.
If you can’t confidently answer: “Where does this metric come from?” in under 60 seconds, your project needs clearer marts, better docs, or fewer duplicate definitions.
FAQ
What’s the difference between dbt and ETL tools?
dbt focuses on transformations inside your warehouse (the “T” in ELT). It doesn’t replace ingestion tools; it replaces the untracked SQL that usually grows after ingestion is solved.
Should staging models be views or tables?
Default to views for staging, then promote to tables when performance demands it. Views are great for iteration and clarity. If downstream chains of views get slow, convert the expensive layers (often intermediate/marts) to tables or incrementals.
When should I use incremental models?
Use incremental models for large tables where rebuilding is costly and the data has a reliable update pattern. You need a stable unique key and a correct incremental filter (often with a rolling window) to handle late-arriving updates.
Is dbt run enough, or should I use dbt build?
Use dbt build for the cleanest workflow. It runs models and associated tests (and can include seeds/snapshots) as a single pipeline, which encourages “only merge when green.”
How many tests should a dbt project have?
Enough to prevent incidents, not enough to create noise. Start with key integrity (not null/unique) and relationships. Add custom tests only for high-cost business rules. If CI becomes flaky, tests will be ignored—so keep signal high.
How do I keep dbt docs from becoming stale?
Document the “how to use this model” details right next to the model and enforce it in PR reviews. Focus docs on grain, keys, join paths, and tricky columns—plus link models to dashboards via exposures when you can.
What’s a good dbt deployment pattern for teams?
Use PR CI for modified models (plus dependencies), and scheduled production jobs for full refreshes. This keeps feedback fast during development while ensuring production is stable and reproducible.
Cheatsheet
A compact checklist for building clean dbt transformations, reviewing PRs, and deploying safely.
Project structure (default)
- sources declared first (raw inputs documented)
- models/staging: rename + cast + dedupe
- models/intermediate: reusable joins/transforms
- models/marts: facts/dims/reporting (business meaning)
- One “owner” per mart domain (even if informal)
High-signal tests (start here)
- not_null + unique on primary keys
- relationships from facts → dims
- accepted_values on critical enums
- One custom business rule test for high-cost failures
- Keep CI green: remove flaky tests or downgrade severity
Daily workflow (fast loop)
| Goal | dbt selection pattern | Why it’s useful |
|---|---|---|
| Build one model + deps | dbt build --select +my_model |
Includes upstream parents so results are consistent |
| Validate a mart end-to-end | dbt build --select +fct_orders+ |
Runs the mart and its downstream dependents |
| PR safety check | dbt build --select state:modified+ |
Only runs changed models plus dependencies |
| Docs sanity | dbt docs generate |
Surfaced missing descriptions and broken refs |
Require every new mart model to include: (1) grain, (2) primary key tests, and (3) join guidance. That alone prevents most analytics confusion as teams scale.
Wrap-up
The “clean way” to build transformations in dbt is not about fancy macros or perfect folder trees—it’s about clear boundaries: staging standardizes, marts define meaning, and tests encode assumptions so changes don’t silently break downstream work.
Next actions (pick one)
- Refactor one messy model into staging → mart (remove business logic from staging)
- Add not_null/unique tests for your top 3 tables by importance
- Document grain + join paths for one mart that people frequently ask about
- Set up a PR check with state:modified+ so you catch breakage before merge
Want to go deeper? The related posts below pair well with this guide: choosing ETL vs ELT, writing scalable SQL, reading explain plans, and picking the right modeling approach for analytics.
Quiz
Quick self-check (demo). This quiz is auto-generated for data / engineering / databases.