Change Data Capture (CDC) is how you keep systems in sync by shipping only what changed—inserts, updates, and deletes—instead of rerunning expensive “full reloads.” This post explains the practical CDC options, the failure modes teams hit in production, and the design patterns that make CDC pipelines predictable: safe cutovers, ordering rules, idempotent sinks, and schema-change strategies.
Quickstart
If you’re setting up CDC (or debugging one that’s flaky), start here. These steps are the highest-impact decisions that determine whether your pipeline behaves like a reliable stream… or like a nightly lottery.
1) Decide your “source of truth” and latency target
CDC isn’t only a technical choice. It’s a product contract: “How fresh must data be?” and “What happens when we’re behind?”
- Define freshness SLO (e.g., p95 lag < 2 minutes)
- Decide which DB/table is authoritative (and for which columns)
- Define failure behavior: pause, degrade, or fall back to batch
2) Pick a CDC approach (log > triggers > timestamps)
Log-based CDC is usually the cleanest because it captures committed transactions. Triggers and timestamp polling can work—but require stricter discipline.
- Log-based: read WAL/binlog/redo logs (best fidelity)
- Trigger-based: write to an “outbox” table (good for app-owned pipelines)
- Timestamp-based: poll
updated_at(simple, but easiest to get subtly wrong)
3) Make the sink idempotent
Most CDC delivery is at-least-once. That means duplicates happen. Your sink should treat duplicates as harmless.
- Use upserts by primary key (not inserts-only)
- Deduplicate by a monotonic position (LSN/offset) or event_id
- Handle deletes explicitly (tombstones or a deleted flag)
4) Use the safe cutover sequence
The common trap is starting CDC “from now” and quietly losing earlier changes. Cutovers must be consistent.
- Backfill snapshot to target
- Start CDC from a consistent point
- Apply changes until caught up, validate, then switch readers
A fast decision table
| Approach | Best for | Main risk | Rule to stay safe |
|---|---|---|---|
| Log-based CDC | Warehouses, replicas, search indexes, microservice sync | Operational setup (replication slots, retention) | Monitor lag + ensure logs retained while consumers are behind |
| Outbox (triggers/app writes) | App-owned events, reliable integration, audit trails | Outbox growth + backpressure | Use partitioning + cleanup + a strict “processed position” checkpoint |
| Timestamp polling | Small systems, low-change tables, “good enough” sync | Missed rows (ties/time drift) + expensive scans | Use a watermark with a tie-breaker (updated_at + id) and re-read a safety window |
CDC is not “copy rows.” It’s “replay a history.” The big wins come from treating changes as an ordered stream with checkpoints, not as a bunch of one-off jobs.
Overview
Full reloads are the default when teams start: “We’ll rebuild the table every night.” That works until it doesn’t: tables get bigger, SLAs get tighter, and downstream systems want near-real-time updates. CDC replaces reloads with a continuous flow of committed changes.
When CDC is worth it
- You need fresh analytics (minutes, not hours)
- Full reloads are costly (time, compute, locks)
- You replicate data into caches/search/services
- You need reliable “what changed” auditing
When you can skip CDC
- Tables are small and reloads are cheap
- Downstream consumers tolerate daily staleness
- You only need aggregated facts (batch ETL is fine)
- Your schema changes constantly and you can’t version contracts yet
What this post covers
- CDC options: log-based, outbox/triggers, and timestamp polling
- Correctness rules: ordering, duplicates, deletes, and “exactly-once” reality
- Cutover patterns: how to go from batch to CDC without losing updates
- Operations: lag monitoring, retention, backpressure, and schema evolution
Most CDC issues are not “bad tools.” They’re missing contracts: no clear checkpoint, no idempotency, and no plan for deletes or schema changes. Fix those, and your pipeline becomes boring—in a good way.
Core concepts
CDC can look complex because people mix together three different problems: capturing changes, transporting changes, and applying changes. If you separate them, the design becomes much clearer.
1) Change events
A CDC pipeline turns row-level changes into events.
- Key: the primary key (or stable business key)
- Operation: insert/update/delete
- Payload: new values (and sometimes “before” values)
- Position: an ordering token (LSN/offset/SCN)
2) Checkpoints (a.k.a. “where are we?”)
Checkpoints are what make recovery possible.
- Store the last processed position per stream/partition
- Make checkpoints durable (transactional if possible)
- Be able to reprocess from a prior checkpoint safely
CDC approaches (and why they behave differently)
| Approach | How it captures changes | Strengths | Weak spots |
|---|---|---|---|
| Log-based | Reads the database transaction log after commit | Best fidelity, low app impact, captures deletes well | Requires DB configuration, log retention, and offsets |
| Triggers / outbox | Writes “change rows” into an outbox table | App-owned contract, easy to add metadata, audit-friendly | Triggers add write overhead; outbox can grow fast |
| Timestamp polling | Queries rows where updated_at > last watermark | Simple to build, no DB log access needed | Misses ties, heavy scans, can’t reliably detect deletes without extra work |
Ordering, duplicates, and “exactly once”
CDC delivery is typically at-least-once (messages can be retried). “Exactly once” is achievable only with extra constraints: deterministic ordering, transactional writes to the sink, and a checkpoint that advances atomically with the sink update.
What you can rely on
- Events come with a position/offset you can store
- You can replay events after failures
- You can make duplicates harmless with idempotency
What you should not assume
- Global ordering across all tables/partitions
- Exactly-once delivery “for free”
- That deletes will magically propagate without design
If you apply changes out of order (or with a weak watermark), you can overwrite newer data with older updates. The fix is to carry a monotonic position (LSN/offset) into the sink and only accept newer positions per key.
Deletes: tombstones vs soft deletes
CDC pipelines often “look correct” until the first real deletion. Choose an approach up front:
- Tombstone: emit an event indicating the key is deleted and remove it downstream
- Soft delete: keep the row but set
deleted=true(useful for auditing or GDPR workflows) - Type 2 history: keep versions with valid_from/valid_to (best for analytics history)
Step-by-step
This guide is tool-agnostic: the steps are the same whether you use a managed service, a connector framework, or a custom pipeline. The goal is a pipeline you can restart, replay, and reason about.
Step 1 — Define the contract (the “truth” you want downstream)
- Tables: which tables are replicated (and which are out of scope)
- Keys: stable primary keys; avoid mutable business keys
- Semantics: do you need “current state” or full history?
- Deletes: hard delete, soft delete, or tombstones?
- Lag budget: what’s acceptable when the pipeline is behind?
Step 2 — Backfill a snapshot (get the baseline into the sink)
CDC streams changes, not entire state. You still need an initial baseline. Keep it boring: bulk-load the current state into the sink, record the cutover point, and don’t mix “random snapshots” with “live changes.”
Snapshot checklist
- Export from the source with a consistent view (transaction or snapshot mode)
- Load into a staging table first (then swap/merge)
- Capture the stream start position (LSN/offset) for the next step
Common cutover mistake
Starting the stream “from now” after the snapshot means you can miss changes that happened during the snapshot. The fix: snapshot + stream must share a consistent boundary.
Step 3 — Start streaming changes (log-based example)
In log-based CDC, a connector reads committed changes from the database log and emits structured change events to a transport (often a message broker). Even if you use a managed tool, you should understand the moving parts: connection, checkpoint storage, and how schema changes are represented.
{
"name": "cdc-postgres-orders",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres.internal",
"database.port": "5432",
"database.user": "cdc_user",
"database.password": "${CDC_PASSWORD}",
"database.dbname": "appdb",
"database.server.name": "appdb",
"slot.name": "debezium_orders_slot",
"publication.autocreate.mode": "filtered",
"schema.include.list": "public",
"table.include.list": "public.orders,public.order_items",
"plugin.name": "pgoutput",
"snapshot.mode": "initial",
"tombstones.on.delete": "true",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"topic.prefix": "cdc",
"heartbeat.interval.ms": "10000",
"errors.tolerance": "all",
"errors.deadletterqueue.topic.name": "cdc.dlq"
}
}
Without a delete signal, downstream systems slowly diverge: search indexes show removed items, caches never evict, and analytics overcounts. Decide on delete propagation early.
Step 4 — Apply changes to the sink (upsert + “only accept newer”)
The sink is where CDC correctness is won. You want a deterministic rule: “For a given primary key, apply the newest change and ignore duplicates/older arrivals.” A common pattern is to store the last-applied position per row (or per key) and only update when the incoming position is newer.
-- Example sink table (current state) stores the last applied position.
-- The position can be a log sequence number (LSN), an offset, or a strictly increasing event_id.
--
-- target.orders_current(order_id PK, status, total_cents, updated_at, source_pos)
INSERT INTO target.orders_current (order_id, status, total_cents, updated_at, source_pos)
SELECT
s.order_id,
s.status,
s.total_cents,
s.updated_at,
s.source_pos
FROM staging.orders_cdc AS s
WHERE s.op IN ('c', 'u')
ON CONFLICT (order_id) DO UPDATE
SET
status = EXCLUDED.status,
total_cents = EXCLUDED.total_cents,
updated_at = EXCLUDED.updated_at,
source_pos = EXCLUDED.source_pos
WHERE EXCLUDED.source_pos > target.orders_current.source_pos;
-- Handle deletes explicitly (either hard-delete or soft-delete).
DELETE FROM target.orders_current t
USING staging.orders_cdc s
WHERE s.op = 'd'
AND s.order_id = t.order_id
AND s.source_pos > t.source_pos;
Mini checklist: “apply” logic that survives production
- Upsert by primary key (avoid insert-only sinks)
- Carry a monotonic position into the sink (per row or per key)
- Ignore duplicates (same position) and out-of-order arrivals (older position)
- Apply deletes deterministically (delete or mark deleted)
Step 5 — Make consumers restart-safe (checkpointing + idempotency)
You should be able to stop your pipeline mid-stream, restart it, and end up in the same state. That requires: (1) storing progress (checkpoint/offset), and (2) making reprocessing safe (idempotent writes).
import json
import psycopg2
from psycopg2.extras import execute_values
# Strategy:
# - Each CDC event has an (stream_id, position) tuple.
# - We write data changes and the checkpoint in ONE transaction.
# - If we crash and retry, the unique constraint prevents double-advancing incorrectly.
DDL = """
CREATE TABLE IF NOT EXISTS cdc_checkpoints (
stream_id TEXT PRIMARY KEY,
last_pos TEXT NOT NULL,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Optional: track processed positions to dedupe (useful if events can repeat with same pos).
CREATE TABLE IF NOT EXISTS cdc_processed (
stream_id TEXT NOT NULL,
pos TEXT NOT NULL,
PRIMARY KEY (stream_id, pos)
);
"""
def apply_batch(conn, stream_id: str, events: list[dict]) -> None:
# events: [{"pos": "...", "op": "c|u|d", "key": {"order_id": 1}, "after": {...}}]
with conn:
with conn.cursor() as cur:
# 1) Deduplicate positions (safe if batch repeats)
execute_values(
cur,
"INSERT INTO cdc_processed(stream_id, pos) VALUES %s ON CONFLICT DO NOTHING",
[(stream_id, e["pos"]) for e in events],
)
# 2) Apply changes (idempotent upsert/delete).
# In real pipelines, route per table/topic and validate schema.
for e in events:
order_id = e["key"]["order_id"]
pos = e["pos"]
if e["op"] in ("c", "u"):
after = e["after"]
cur.execute(
"""
INSERT INTO target.orders_current(order_id, status, total_cents, updated_at, source_pos)
VALUES (%s, %s, %s, %s, %s)
ON CONFLICT (order_id) DO UPDATE
SET status = EXCLUDED.status,
total_cents = EXCLUDED.total_cents,
updated_at = EXCLUDED.updated_at,
source_pos = EXCLUDED.source_pos
WHERE EXCLUDED.source_pos > target.orders_current.source_pos
""",
(order_id, after["status"], after["total_cents"], after["updated_at"], pos),
)
elif e["op"] == "d":
cur.execute(
"""
DELETE FROM target.orders_current
WHERE order_id = %s AND source_pos <= %s
""",
(order_id, pos),
)
# 3) Advance checkpoint to the max position in this batch.
last_pos = max(e["pos"] for e in events)
cur.execute(
"""
INSERT INTO cdc_checkpoints(stream_id, last_pos)
VALUES (%s, %s)
ON CONFLICT (stream_id) DO UPDATE
SET last_pos = EXCLUDED.last_pos, updated_at = now()
""",
(stream_id, last_pos),
)
if __name__ == "__main__":
conn = psycopg2.connect("dbname=warehouse user=etl password=secret host=127.0.0.1")
with conn:
with conn.cursor() as cur:
cur.execute(DDL)
# Demo batch (pretend it came from a broker/connector)
demo = [
{"pos": "00000016/000000A0", "op": "u", "key": {"order_id": 42},
"after": {"status": "paid", "total_cents": 1299, "updated_at": "2026-01-09T12:00:00Z"}}
]
apply_batch(conn, "orders_stream", demo)
Never advance a checkpoint unless the sink update is committed too. If you checkpoint first and crash, you’ll skip events and your sink will silently diverge.
Step 6 — Plan for schema changes (because they will happen)
CDC pipelines don’t fail because of “bad SQL.” They fail because schema changes are treated as surprises. Decide your rules:
- Additive changes (safe): new nullable columns, new optional fields (usually safe to ignore at first)
- Breaking changes: rename columns, change types, split/merge tables (requires versioning and coordinated rollout)
- Event contract: do you ship full-row state (“after”) or a minimal patch? Full-row tends to be easier downstream.
A practical schema-evolution rule
If a change would require reinterpreting old events, create a new versioned stream (or new topic/table) and migrate consumers deliberately. Treat the schema as a contract, not an implementation detail.
Step 7 — Operate CDC like a production system
Signals to monitor
- Lag (seconds/minutes behind)
- Throughput and backpressure (events/sec)
- Error rate + DLQ growth
- Log retention risk (WAL/binlog/redo growth)
Data correctness checks
- Daily reconciliation on counts and key aggregates
- Spot-check recent updates across source and sink
- Alerts for “no changes seen” on hot tables (often a connector stall)
- Periodic re-backfill for small reference tables (cheap and safe)
Common mistakes
CDC bugs are rarely loud. They’re usually subtle: one missed update here, one duplicate there, and eventually your numbers don’t match. Here are the most common causes—plus the straightforward fixes.
Mistake 1 — Using updated_at as the only watermark
Timestamp polling can miss rows when multiple updates share the same timestamp or when clocks drift.
- Fix: use a tie-breaker: (updated_at, id) and store both in the checkpoint.
- Fix: re-read a small safety window (e.g., last 5–10 minutes) and dedupe by key + position.
Mistake 2 — Ignoring deletes
Downstream systems diverge slowly: stale cache entries, zombie search results, overcounted analytics.
- Fix: emit tombstones (delete events) or soft deletes; apply them deterministically.
- Fix: run reconciliation checks that include “missing keys,” not only totals.
Mistake 3 — No idempotency (duplicates break the sink)
At-least-once delivery means retries. Without idempotent writes, duplicates become data corruption.
- Fix: upsert by primary key and keep a last-applied position per row/key.
- Fix: make checkpoint advancement atomic with sink writes (same transaction when possible).
Mistake 4 — Overwriting newer data with older events
Out-of-order arrivals happen with partitions, retries, or multi-writer setups.
- Fix: only accept updates where incoming position > stored position.
- Fix: ensure your “position” is truly monotonic for the stream you consume.
Mistake 5 — Cutover without a consistent boundary
Snapshot and streaming must meet at a known point, or you lose changes during the snapshot window.
- Fix: capture the stream start position from the snapshot and start CDC from that position.
- Fix: validate with row counts + recent-key spot checks before switching readers.
Mistake 6 — Treating schema changes as “someone else’s problem”
A renamed column can stall a connector or poison downstream parsing.
- Fix: version the contract (schema) and coordinate deployments.
- Fix: prefer additive changes; plan migrations for breaking changes.
When source and sink disagree, don’t start by “reloading everything.” Pick a small set of recently updated keys and trace their change history (positions/offsets). The first missing or misordered event is usually the root cause.
FAQ
Is CDC the same as incremental ETL?
No. Incremental ETL usually means “process new/changed rows on a schedule,” often using timestamps. CDC is specifically about capturing committed changes as a stream (often from the database log) with checkpoints, replay, and delete semantics.
Do I need Kafka to do CDC?
No. Kafka is a common transport because it’s durable and replayable, but CDC can work with managed streams, queues, cloud migration services, or even direct apply-to-sink connectors. The core requirements are durability, ordering rules, and checkpointing—not a specific product.
Can CDC be “exactly once”?
Sometimes, but don’t assume it. Most CDC systems deliver at-least-once. You get correctness by making sinks idempotent and by advancing checkpoints only when sink writes commit. “Exactly once” typically requires transactional guarantees across transport + sink or a carefully designed apply/checkpoint transaction.
How do I handle deletes in CDC pipelines?
Explicitly. Either propagate hard deletes via tombstone/delete events, or implement soft deletes with a deleted flag. If you ignore deletes, your downstream systems will slowly drift, even if updates look correct.
What’s the most common CDC cutover sequence?
Backfill → stream → validate → switch. Load a snapshot into the sink, start CDC from a consistent boundary, apply changes until caught up, validate counts and recent updates, then point consumers/readers at the new sink.
What should I store as my checkpoint?
A monotonic position from the capture system. For log-based CDC, use the log position (LSN/offset/SCN). For timestamp polling, store (updated_at, id) and re-read a safety window. Avoid using wall-clock time alone as a checkpoint.
Cheatsheet
A scan-fast CDC checklist you can keep next to your pipeline dashboard.
Design decisions
- Pick CDC mode: log-based / outbox / timestamp polling
- Define keys: stable PKs, avoid mutable business keys
- Define semantics: current state vs history (SCD2)
- Define deletes: tombstone vs soft delete
- Define schema evolution strategy: additive-first, version breaking changes
Correctness rules
- Assume at-least-once delivery (duplicates happen)
- Make sink idempotent (upsert + dedupe)
- Carry a monotonic position into the sink
- Only accept events with position > last applied
- Advance checkpoint only after sink commits
Cutover plan
- Backfill snapshot to staging
- Record stream start position
- Start CDC from that position
- Apply changes until caught up
- Validate (counts + recent keys), then switch readers
Operations
- Alert on lag and stalled streams
- Watch log retention vs lag (risk of losing changes)
- Track DLQ volume and top error types
- Run reconciliation checks (not just totals)
- Document runbooks: restart, replay, and backfill procedures
It’s usually missing one of these: a stable checkpoint, idempotent apply logic, or a delete strategy. Add those and most other issues become debuggable engineering problems instead of mysteries.
Wrap-up
CDC is one of those “invisible” systems that, when done right, feels like magic: data is fresh everywhere and nobody talks about it. When done wrong, it slowly erodes trust because numbers don’t match and nobody knows why.
The practical recipe is simple: choose the right capture method, treat changes as an ordered stream with checkpoints, make your sink idempotent, plan for deletes and schema changes, and use a safe cutover sequence (backfill → stream → validate → switch).
Next actions
- Pick one table and implement CDC end-to-end (including deletes)
- Add lag + DLQ + reconciliation checks before scaling to more tables
- Write a short runbook: restart, replay, and “how to cut over”
Scale the operational maturity before you scale the table count. A CDC pipeline for one critical table is manageable. A CDC pipeline for fifty tables without monitoring and contracts is how teams end up back on full reloads.
Quiz
Quick self-check. This quiz is tuned for CDC / pipelines / databases.