Data Engineering & Databases · Costs

Warehouse Cost Control: The Queries That Burn Money

Spot expensive patterns and keep usage predictable.

Reading time: ~8–12 min
Level: All levels
Updated:

Spot expensive patterns and keep usage predictable. This guide shows the query shapes that quietly burn budget (full table scans, join explosions, runaway BI refreshes), how to find them fast, and how to put guardrails in place so cost stays boring.


Quickstart

If you only have an hour, do these in order. The goal is to turn “we think costs are high” into a short list of specific queries and specific fixes—then prevent repeats.

1) Confirm your billing unit

Warehouses charge in different ways (bytes scanned, compute time/credits, cluster seconds). Your “expensive query” detector should rank by the unit that drives your bill.

  • Locate the system table / query history view your warehouse provides
  • Identify the metric to sort by (credits, bytes, slot-ms, runtime)
  • Decide a simple threshold: “anything above X is review-worthy”

2) Build a “Top 20 cost queries” view

Don’t start by optimizing random queries. Start with the top offenders and fix the few that drive the majority of spend.

  • Group by query signature (same SQL shape, different parameters)
  • Include user, tool (BI/dbt), and schedule (hourly/daily)
  • Store results in a small table so you can trend over time

3) Kill the common “money burners” first

A few patterns show up everywhere. Fixing them is repeatable and usually doesn’t require a full redesign.

  • Replace SELECT * with a minimal projection
  • Add partition filters (date / shard / tenant) to enable pruning
  • Remove accidental many-to-many joins (join keys not unique)
  • Cap unbounded windows and huge DISTINCT operations

4) Add two guardrails (today)

Guardrails make cost predictable even when humans do human things (copy/paste, “quick” explorations, broken dashboards).

  • Set statement timeouts / max bytes scanned / resource monitors (vendor-specific)
  • Introduce a “heavy queries” role/warehouse and keep it separated
  • Add a daily alert when spend exceeds a baseline by N%
The 80/20 rule for warehouse cost control

Most teams can cut a meaningful chunk of spend by fixing 10–20 recurring query signatures (usually from dashboards, scheduled jobs, and “analysis notebooks”). Start there before you debate a new architecture.

Overview

“Warehouse cost control” isn’t about shaving milliseconds off every SQL statement. It’s about removing the patterns that make usage unpredictable: a dashboard refresh that scans terabytes, a join that multiplies rows by accident, or a model build that reprocesses history every run.

What this post covers

  • How to find the queries that burn money (and how to group them so you fix the root cause)
  • A mental model for cost: scan, compute, and concurrency
  • A practical optimization playbook for the most expensive SQL shapes
  • Guardrails that prevent regressions (budgets, limits, and workload separation)
  • Checklists you can copy into your team’s runbook
Where cost spikes usually come from What it looks like Typical fix
Over-scanning Bytes read jumps; same result, bigger bill Projection + partition filters + pruning
Join explosion Rows processed >> rows returned Fix join keys, pre-aggregate, de-duplicate
Repeat compute Same expensive transformation runs many times Materialize, incrementalize, cache strategically
Concurrency Dozens of queries run at once (BI refresh, backfills) Scheduling, workload isolation, quotas
Measure before you optimize

Warehouse optimization is easy to get wrong if you guess. Start with query history, rank by cost, and fix the top offenders. Most teams are surprised by what’s actually expensive.

Core concepts

To control warehouse spend, you need a simple mental model: cost is usually driven by some combination of data scanned, compute time, and concurrency. Almost every “money-burning query” is failing at least one of these.

1) The cost triangle: scan, compute, concurrency

Scan

How much data you read from storage (tables, partitions, micro-partitions, files).

  • Drives cost heavily in scan-priced systems
  • Highly sensitive to missing filters and SELECT *
  • Often improves with partitioning/clustering and good predicates

Compute

How much CPU/memory work the engine performs (joins, sorts, shuffles, UDFs).

  • Explodes with big joins, DISTINCT, and unbounded windows
  • Improves with better join keys, pre-aggregation, and materialization
  • Can be reduced by avoiding repeated transformation work

Concurrency (the silent multiplier)

Even efficient queries become expensive when they run too often or all at once. BI tools, scheduled backfills, and retries are common sources of accidental concurrency.

  • Schedule heavy jobs outside peak dashboard hours
  • Separate workloads (interactive vs batch)
  • Set quotas/limits so one team can’t drain the pool

2) Query signature beats “top queries”

A single expensive query is interesting. A query that runs 1,000 times a day is a budget line item. That’s why you should group by query signature: the same SQL shape with different parameters.

How to think about “signature”

  • Normalize whitespace and literals (dates/IDs) into placeholders
  • Group by normalized SQL + tool (BI vs dbt vs notebook)
  • Track count, p95 runtime, and total cost over a window

3) Pruning: the warehouse’s superpower

Modern warehouses avoid reading data they can prove you don’t need. This is called pruning. It only works if your query gives the engine a strong clue (usually a filter on a partition key or clustering key).

Pruning-friendly predicates

  • WHERE event_date BETWEEN '2026-01-01' AND '2026-01-31'
  • WHERE tenant_id = ? (if data is clustered by tenant)
  • WHERE created_at >= CURRENT_DATE - INTERVAL '7' DAY

Pruning killers

  • Wrapping the partition column in a function (engine can’t prune)
  • Filtering after a big join instead of before (too late)
  • “Wildcard” date ranges (e.g., last 5 years) for dashboards

4) Materialization: stop paying for the same work twice

If a complex transformation is reused (dashboards, many downstream models, repeated ad-hoc analysis), you should consider materializing it: a table, incremental model, or materialized view. The trick is to materialize the stable expensive part, not everything.

Optimization isn’t only SQL

Many “expensive queries” are expensive because they run too often. Fixing refresh schedules, caching, and workload isolation can cut spend without touching a single JOIN.

Step-by-step

This section is a practical workflow you can repeat monthly. It turns cost control into a routine: observe → rank → fix → guardrail → verify.

Step 1 — Make cost visible (query history → a small mart)

Your warehouse already logs query execution. The fastest win is to capture those logs into a small table (a “cost mart”) that your team can query like any other dataset.

Starter SQL: top query signatures by spend (generic shape)

Adjust table/column names for your warehouse. The important part is: group by a normalized query text (signature), and rank by a cost metric you trust.

-- Create a daily rollup of expensive query signatures (starter pattern).
-- Replace warehouse_query_history with your system view/table.
-- Replace "cost_units" with credits/bytes/slot_ms/etc. depending on your platform.

WITH base AS (
  SELECT
    DATE_TRUNC('day', started_at) AS day,
    user_name,
    tool_name,
    -- Very simple normalization: trim + collapse whitespace.
    -- In practice, you may also replace literals (dates/ids) with placeholders.
    REGEXP_REPLACE(TRIM(query_text), '\\s+', ' ') AS query_signature,
    total_runtime_ms,
    cost_units
  FROM warehouse_query_history
  WHERE started_at >= CURRENT_DATE - INTERVAL '14' DAY
),
agg AS (
  SELECT
    day,
    tool_name,
    query_signature,
    COUNT(*) AS runs,
    SUM(cost_units) AS total_cost_units,
    AVG(cost_units) AS avg_cost_units,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_runtime_ms) AS p95_runtime_ms
  FROM base
  GROUP BY 1,2,3
)
SELECT *
FROM agg
ORDER BY total_cost_units DESC
LIMIT 20;
What to save in your “cost mart”
  • Signature (normalized query), tool, user, role/warehouse
  • Runs, total cost, avg cost, p95 runtime
  • Rows scanned/processed (if available), plus rows returned
  • Error/timeout counts (retries can be expensive)

Step 2 — Triage: classify each offender by “why it’s expensive”

For each top signature, put it into a bucket. This avoids “random fixes” and points you at the highest-leverage change.

Bucket Fast diagnosis First fix to try
Over-scan Huge bytes read; low rows returned Project fewer columns; add partition filter; prune early
Join explosion Rows processed extremely high vs base tables Check key uniqueness; pre-aggregate; de-duplicate
Sort/shuffle heavy Big ORDER BY, DISTINCT, wide grouping Reduce columns; aggregate earlier; avoid global sorts
Repeat compute Same query runs constantly (dashboards/jobs) Materialize; cache; schedule; isolate workload
Concurrency spike Many runs start in the same minute Stagger refresh; throttle; separate warehouses/queues

Step 3 — Fix the expensive patterns (playbook)

Here are the repeatable fixes that show up across warehouses. Apply them to your top offenders first, then re-measure.

A) Stop scanning what you don’t use (projection + predicate pruning)

  • Replace SELECT * with only required columns (especially wide event tables)
  • Filter on partition keys (date, tenant, shard) as early as possible
  • Push predicates into CTEs/subqueries that feed joins (prune before join)
  • Avoid functions on partition columns in WHERE (pruning killer)

Example: refactor an over-scanning dashboard query

This is the most common cost burn: a “simple” dashboard query scanning a full fact table because it doesn’t filter on the partition key and pulls every column.

-- BEFORE: wide scan + late filtering
-- SELECT * plus no partition pruning makes this expensive on large tables.

SELECT
  *
FROM fact_events
WHERE DATE(created_at) >= CURRENT_DATE - INTERVAL '30' DAY
  AND event_name IN ('purchase', 'refund');

-- AFTER: prune first + project only needed columns + avoid function on partition key
WITH filtered AS (
  SELECT
    event_date,          -- assume this is a partition column
    user_id,
    order_id,
    event_name,
    revenue_usd
  FROM fact_events
  WHERE event_date >= CURRENT_DATE - INTERVAL '30' DAY
    AND event_name IN ('purchase', 'refund')
)
SELECT
  event_date,
  event_name,
  COUNT(*) AS events,
  COUNT(DISTINCT user_id) AS users,
  SUM(revenue_usd) AS revenue_usd
FROM filtered
GROUP BY 1,2
ORDER BY event_date DESC;

B) Prevent join explosions (cardinality control)

Join explosions happen when your join keys aren’t unique on at least one side (or when you accidentally join fact-to-fact). The result: the engine processes a huge intermediate table even if the final result is small.

Red flags

  • Rows processed is orders of magnitude larger than inputs
  • Adding a JOIN increases runtime/cost drastically
  • The query “works” but totals look too high (double-counting)
  • Someone “fixed” it with DISTINCT (usually a smell)

Fixes that actually stick

  • Enforce uniqueness in dimension tables (one row per key)
  • Pre-aggregate fact tables before joining
  • Use bridge tables intentionally (many-to-many is valid, but explicit)
  • Add constraints/tests in transformation layer (e.g., uniqueness tests)

C) Reduce repeat compute (materialize and schedule)

If the same expensive query runs repeatedly (dashboards or downstream models), materializing can be cheaper than re-computing. Choose materialization when the query is stable, reused, and the output size is reasonable.

A simple decision rule

If a query is in your top 20 by total cost and runs multiple times per day, it’s a candidate for materialization or caching. If it runs once a month, focus on correctness and safety first.

Step 4 — Add guardrails: predictable cost beats heroic debugging

Guardrails don’t replace optimization—they prevent regressions and protect you from “one bad query” incidents. The best guardrails are visible, simple, and tied to ownership.

Operational guardrails

  • Statement timeouts (interactive defaults)
  • Per-user / per-role quotas for heavy workloads
  • Separate compute pools: interactive vs batch vs backfills
  • Retry limits for scheduled jobs (avoid infinite failure loops)

Governance guardrails

  • Tag queries/jobs by team or cost center
  • Weekly “top offenders” review (30 minutes)
  • Budgets and alerts (daily/weekly spend vs baseline)
  • Require code review for scheduled queries over a threshold

A small daily spend alert (example script)

Use this pattern to notify a channel when spend deviates from normal. Replace the SQL with your warehouse’s system tables, and replace the webhook with your preferred notifier. Keep it simple: baseline + threshold + top offenders.

import os
import json
import datetime as dt
import urllib.request

# Environment variables you define in your scheduler:
# WAREHOUSE_DSN, ALERT_WEBHOOK_URL
#
# Pseudocode: replace run_query() with your warehouse client.
def run_query(sql: str):
  # return list[dict]
  raise NotImplementedError("Connect your warehouse client here.")

def post_webhook(url: str, payload: dict):
  data = json.dumps(payload).encode("utf-8")
  req = urllib.request.Request(url, data=data, headers={"Content-Type": "application/json"})
  urllib.request.urlopen(req, timeout=10).read()

today = dt.date.today()
yday = today - dt.timedelta(days=1)
window_start = today - dt.timedelta(days=14)

# 1) Pull yesterday spend + baseline (simple rolling average)
spend_sql = f"""
WITH daily AS (
  SELECT
    DATE_TRUNC('day', started_at) AS day,
    SUM(cost_units) AS spend
  FROM warehouse_query_history
  WHERE started_at >= DATE '{window_start.isoformat()}'
    AND started_at <  DATE '{today.isoformat()}'
  GROUP BY 1
)
SELECT
  (SELECT spend FROM daily WHERE day = DATE '{yday.isoformat()}') AS spend_yday,
  (SELECT AVG(spend) FROM daily WHERE day < DATE '{yday.isoformat()}') AS baseline_avg
"""

rows = run_query(spend_sql)
spend_yday = float(rows[0]["spend_yday"] or 0.0)
baseline_avg = float(rows[0]["baseline_avg"] or 0.0)

# 2) Threshold check (tune to taste)
threshold_multiplier = float(os.getenv("ALERT_THRESHOLD_MULTIPLIER", "1.5"))
should_alert = baseline_avg > 0 and spend_yday > baseline_avg * threshold_multiplier

if should_alert:
  top_sql = f"""
  SELECT
    tool_name,
    REGEXP_REPLACE(TRIM(query_text), '\\s+', ' ') AS query_signature,
    COUNT(*) AS runs,
    SUM(cost_units) AS total_cost_units
  FROM warehouse_query_history
  WHERE started_at >= DATE '{yday.isoformat()}'
    AND started_at <  DATE '{today.isoformat()}'
  GROUP BY 1,2
  ORDER BY total_cost_units DESC
  LIMIT 5
  """
  top = run_query(top_sql)

  message_lines = [
    f"Warehouse spend alert for {yday.isoformat()}",
    f"Spend: {spend_yday:.2f} (baseline avg: {baseline_avg:.2f})",
    "",
    "Top offenders:",
  ]
  for i, r in enumerate(top, 1):
    message_lines.append(f"{i}) [{r['tool_name']}] {r['total_cost_units']:.2f} units · {r['runs']} runs")
    message_lines.append(f"   {r['query_signature'][:140]}")

  payload = {"text": "\n".join(message_lines)}
  post_webhook(os.environ["ALERT_WEBHOOK_URL"], payload)
Don’t alert on “total spend” only

Total spend is useful, but it’s not actionable by itself. Always include the top offenders list and the owner signal (tool/user/team) so the alert leads to a fix—not a blame session.

Step 5 — Verify and prevent regressions

  • Re-run the cost ranking after each fix (did the signature drop?)
  • Capture before/after: runtime, bytes, and total cost over 7 days
  • Add a test or review rule for the pattern you fixed (so it doesn’t return)
  • Document the decision: what changed, why, and who owns it

Common mistakes

These are the patterns behind “the warehouse bill jumped” and “the dashboard is slow”. The fixes are usually boring—and that’s a good thing.

Mistake 1 — Treating SELECT * as harmless

On wide tables, SELECT * pulls columns you don’t need, increases scan, and blocks some optimizations. It also makes schema changes riskier.

  • Fix: select only required columns; create curated “analytics views” with stable schemas.
  • Fix: add a lint rule or review checklist for production SQL.

Mistake 2 — Filtering on the wrong thing (no pruning)

Filtering on DATE(created_at) looks correct but often prevents pruning on a partition column.

  • Fix: filter directly on the partition column (e.g., event_date).
  • Fix: push filters into early CTEs before joins.

Mistake 3 — Accidental many-to-many joins

This is the fastest way to create huge intermediate datasets and wrong totals. It’s also hard to notice until cost spikes.

  • Fix: validate uniqueness of join keys (dimensions should be 1 row per key).
  • Fix: pre-aggregate facts before joining; avoid joining fact-to-fact unless intentional.

Mistake 4 — Using DISTINCT as a “band-aid”

DISTINCT can hide a modeling problem and often forces expensive sorts/shuffles.

  • Fix: fix the join duplication instead of de-duplicating late.
  • Fix: if you must de-duplicate, do it in a narrow, pre-filtered subquery.

Mistake 5 — Unbounded windows and global sorts

Window functions are powerful, but unbounded frames and global ORDER BY over huge tables can be very costly.

  • Fix: limit the time range; pre-aggregate; restrict window frames.
  • Fix: only sort at the end, and only the rows you’ll actually display.

Mistake 6 — Letting dashboards refresh “as fast as possible”

A 5-minute refresh on a heavy model is rarely worth the cost. BI refresh schedules are a common hidden multiplier.

  • Fix: stagger refreshes; increase intervals; cache stable layers.
  • Fix: build a single “serving table” for dashboards instead of recomputing raw logic.
If you’re unsure where to start

Find the top query signatures by total cost over 14 days. Then ask two questions: (1) can we read less? (2) can we run this less often? Those two fixes cover most real-world savings.

FAQ

How do I find my most expensive queries?

Use your warehouse’s query history tables/views and rank by the metric your bill uses (credits, bytes scanned, slot-ms, runtime). Group by query signature (normalized SQL), not by raw query ID, so you can fix repeated patterns instead of one-offs.

What’s the fastest optimization that usually saves money?

Add pruning-friendly filters (partition keys) and remove SELECT *. These reduce data scanned and often reduce compute too. If a query is repeated (dashboards/jobs), the next fastest win is to run it less often or materialize the expensive layer.

Partitioning vs clustering: which one helps cost more?

Partitioning helps most when your queries reliably filter by the partition key (often date). Clustering helps when you filter by other columns (tenant, user, status) and want better pruning within partitions. In practice: partition by the most common “time slice”, and cluster by the most common selective filters you use in WHERE and joins.

Is caching a real cost strategy or just a performance trick?

Caching can reduce repeated compute, but it’s not a guaranteed control mechanism by itself because cache hit rates depend on query shape, parameters, and timing. Treat caching as a bonus. For predictable cost, prefer materialization + scheduling + workload isolation.

When should I materialize a query (table/view/materialized view)?

Materialize when the query is expensive, stable, reused, and the output is much smaller than the input. A good sign is a query that appears in your top offenders list and runs many times per day (dashboards, downstream models).

How do we set limits without blocking legitimate analysis?

Separate workloads: keep interactive exploration on a “default” pool with reasonable timeouts, and create an explicit “heavy” pool for big backfills/one-off analysis with approvals. Add clear error messages and a path forward (“run this in batch warehouse”), and you’ll reduce both cost and frustration.

What metrics should we monitor weekly for warehouse cost control?

Track total spend, spend by team/tool, top query signatures, p95 runtime for recurring queries, and concurrency peaks. Weekly review should result in a short list: “fix these 3 signatures” and “adjust these 2 schedules/guardrails.”

Cheatsheet

A scan-fast checklist for warehouse cost control and the queries that burn money.

Detect: find the money burners

  • Rank by cost metric your bill uses (credits/bytes/runtime)
  • Group by query signature (normalize SQL)
  • Capture tool + user + schedule (BI/dbt/notebook)
  • Look for repeated queries and concurrency spikes
  • Record before/after metrics for every fix

Diagnose: why is it expensive?

  • Over-scan: big bytes read; tiny output
  • Join explosion: intermediate rows explode
  • Sort/shuffle: DISTINCT, wide group-bys, global ORDER BY
  • Repeat compute: expensive query runs constantly
  • Concurrency: same query starts many times per minute

Fix: the highest-leverage moves

  • Project only needed columns (no SELECT *)
  • Filter on partition keys (enable pruning)
  • Push filters before joins (prune early)
  • Validate join key uniqueness; pre-aggregate facts
  • Materialize reused heavy layers; incrementalize backfills
  • Reduce refresh frequency; stagger schedules

Prevent: guardrails for predictable spend

  • Timeouts / quotas / resource monitors (vendor-specific)
  • Separate interactive vs batch compute pools
  • Daily baseline alert + top offenders list
  • Ownership: tags by team/cost center
  • Weekly “top signatures” review (30 minutes)
Print this mental model

If a query is expensive, it’s usually because it reads too much, computes too much, or runs too often. Fixing any one of those is a win. Fixing two is how budgets become predictable.

Wrap-up

Warehouse cost control is mostly about routine: measure what’s expensive, fix the patterns that repeat, and add guardrails so cost stays stable as the team and data grow. If you implement just one thing from this post, make it a weekly “top query signatures” review with clear ownership and a short optimization backlog.

A simple next-week plan

  • Build a 14-day “top signatures” report from query history
  • Pick the top 5 and label each: over-scan / join explosion / repeat compute / concurrency
  • Apply one fix per query (small, surgical changes)
  • Add one guardrail: timeout/quota or workload separation
  • Re-measure after 7 days and document the before/after
Keep optimization boring

The best outcome is not a heroic rewrite. It’s a warehouse where the bill makes sense, the dashboards are stable, and “runaway queries” are rare because the system nudges people toward safe defaults.

Quiz

Quick self-check (demo). This quiz is auto-generated for data / engineering / databases.

1) In warehouse cost control, what most often causes a “surprise” bill spike from a single query?
2) Your warehouse spend jumped yesterday. What’s the best first move?
3) Which pattern is a classic sign of a join explosion?
4) For pruning to reduce scan cost, what usually must be true?