Change Data Capture (CDC): Sync Data Without Full Reloads

By Samuel Labant Published Jan 9, 2026 Updated Jan 9, 2026

Change Data Capture (CDC) is how you keep systems in sync by shipping only what changed—inserts, updates, and deletes—instead of rerunning expensive “full reloads.” This post explains the practical CDC options, the failure modes teams hit in production, and the design patterns that make CDC pipelines predictable: safe cutovers, ordering rules, idempotent sinks, and schema-change strategies.

Quickstart

If you’re setting up CDC (or debugging one that’s flaky), start here. These steps are the highest-impact decisions that determine whether your pipeline behaves like a reliable stream… or like a nightly lottery.

1) Decide your “source of truth” and latency target

CDC isn’t only a technical choice. It’s a product contract: “How fresh must data be?” and “What happens when we’re behind?”

Define freshness SLO (e.g., p95 lag < 2 minutes)
Decide which DB/table is authoritative (and for which columns)
Define failure behavior: pause, degrade, or fall back to batch

2) Pick a CDC approach (log > triggers > timestamps)

Log-based CDC is usually the cleanest because it captures committed transactions. Triggers and timestamp polling can work—but require stricter discipline.

Log-based: read WAL/binlog/redo logs (best fidelity)
Trigger-based: write to an “outbox” table (good for app-owned pipelines)
Timestamp-based: poll updated_at (simple, but easiest to get subtly wrong)

3) Make the sink idempotent

Most CDC delivery is at-least-once. That means duplicates happen. Your sink should treat duplicates as harmless.

Use upserts by primary key (not inserts-only)
Deduplicate by a monotonic position (LSN/offset) or event_id
Handle deletes explicitly (tombstones or a deleted flag)

4) Use the safe cutover sequence

The common trap is starting CDC “from now” and quietly losing earlier changes. Cutovers must be consistent.

Backfill snapshot to target
Start CDC from a consistent point
Apply changes until caught up, validate, then switch readers

A fast decision table

Approach	Best for	Main risk	Rule to stay safe
Log-based CDC	Warehouses, replicas, search indexes, microservice sync	Operational setup (replication slots, retention)	Monitor lag + ensure logs retained while consumers are behind
Outbox (triggers/app writes)	App-owned events, reliable integration, audit trails	Outbox growth + backpressure	Use partitioning + cleanup + a strict “processed position” checkpoint
Timestamp polling	Small systems, low-change tables, “good enough” sync	Missed rows (ties/time drift) + expensive scans	Use a watermark with a tie-breaker (updated_at + id) and re-read a safety window

CDC mindset

CDC is not “copy rows.” It’s “replay a history.” The big wins come from treating changes as an ordered stream with checkpoints, not as a bunch of one-off jobs.

Overview

Full reloads are the default when teams start: “We’ll rebuild the table every night.” That works until it doesn’t: tables get bigger, SLAs get tighter, and downstream systems want near-real-time updates. CDC replaces reloads with a continuous flow of committed changes.

When CDC is worth it

You need fresh analytics (minutes, not hours)
Full reloads are costly (time, compute, locks)
You replicate data into caches/search/services
You need reliable “what changed” auditing

When you can skip CDC

Tables are small and reloads are cheap
Downstream consumers tolerate daily staleness
You only need aggregated facts (batch ETL is fine)
Your schema changes constantly and you can’t version contracts yet

What this post covers

CDC options: log-based, outbox/triggers, and timestamp polling
Correctness rules: ordering, duplicates, deletes, and “exactly-once” reality
Cutover patterns: how to go from batch to CDC without losing updates
Operations: lag monitoring, retention, backpressure, and schema evolution

The 80/20 of CDC success

Most CDC issues are not “bad tools.” They’re missing contracts: no clear checkpoint, no idempotency, and no plan for deletes or schema changes. Fix those, and your pipeline becomes boring—in a good way.

Core concepts

CDC can look complex because people mix together three different problems: capturing changes, transporting changes, and applying changes. If you separate them, the design becomes much clearer.

1) Change events

A CDC pipeline turns row-level changes into events.

Key: the primary key (or stable business key)
Operation: insert/update/delete
Payload: new values (and sometimes “before” values)
Position: an ordering token (LSN/offset/SCN)

2) Checkpoints (a.k.a. “where are we?”)

Checkpoints are what make recovery possible.

Store the last processed position per stream/partition
Make checkpoints durable (transactional if possible)
Be able to reprocess from a prior checkpoint safely

CDC approaches (and why they behave differently)

Approach	How it captures changes	Strengths	Weak spots
Log-based	Reads the database transaction log after commit	Best fidelity, low app impact, captures deletes well	Requires DB configuration, log retention, and offsets
Triggers / outbox	Writes “change rows” into an outbox table	App-owned contract, easy to add metadata, audit-friendly	Triggers add write overhead; outbox can grow fast
Timestamp polling	Queries rows where updated_at > last watermark	Simple to build, no DB log access needed	Misses ties, heavy scans, can’t reliably detect deletes without extra work

Ordering, duplicates, and “exactly once”

CDC delivery is typically at-least-once (messages can be retried). “Exactly once” is achievable only with extra constraints: deterministic ordering, transactional writes to the sink, and a checkpoint that advances atomically with the sink update.

What you can rely on

Events come with a position/offset you can store
You can replay events after failures
You can make duplicates harmless with idempotency

What you should not assume

Global ordering across all tables/partitions
Exactly-once delivery “for free”
That deletes will magically propagate without design

The silent CDC bug

If you apply changes out of order (or with a weak watermark), you can overwrite newer data with older updates. The fix is to carry a monotonic position (LSN/offset) into the sink and only accept newer positions per key.

Deletes: tombstones vs soft deletes

CDC pipelines often “look correct” until the first real deletion. Choose an approach up front:

Tombstone: emit an event indicating the key is deleted and remove it downstream
Soft delete: keep the row but set deleted=true (useful for auditing or GDPR workflows)
Type 2 history: keep versions with valid_from/valid_to (best for analytics history)

Step-by-step

This guide is tool-agnostic: the steps are the same whether you use a managed service, a connector framework, or a custom pipeline. The goal is a pipeline you can restart, replay, and reason about.

Step 1 — Define the contract (the “truth” you want downstream)

Tables: which tables are replicated (and which are out of scope)
Keys: stable primary keys; avoid mutable business keys
Semantics: do you need “current state” or full history?
Deletes: hard delete, soft delete, or tombstones?
Lag budget: what’s acceptable when the pipeline is behind?

Step 2 — Backfill a snapshot (get the baseline into the sink)

CDC streams changes, not entire state. You still need an initial baseline. Keep it boring: bulk-load the current state into the sink, record the cutover point, and don’t mix “random snapshots” with “live changes.”

Snapshot checklist

Export from the source with a consistent view (transaction or snapshot mode)
Load into a staging table first (then swap/merge)
Capture the stream start position (LSN/offset) for the next step

Common cutover mistake

Starting the stream “from now” after the snapshot means you can miss changes that happened during the snapshot. The fix: snapshot + stream must share a consistent boundary.

Step 3 — Start streaming changes (log-based example)

In log-based CDC, a connector reads committed changes from the database log and emits structured change events to a transport (often a message broker). Even if you use a managed tool, you should understand the moving parts: connection, checkpoint storage, and how schema changes are represented.

{
  "name": "cdc-postgres-orders",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres.internal",
    "database.port": "5432",
    "database.user": "cdc_user",
    "database.password": "${CDC_PASSWORD}",
    "database.dbname": "appdb",
    "database.server.name": "appdb",
    "slot.name": "debezium_orders_slot",
    "publication.autocreate.mode": "filtered",
    "schema.include.list": "public",
    "table.include.list": "public.orders,public.order_items",

    "plugin.name": "pgoutput",
    "snapshot.mode": "initial",
    "tombstones.on.delete": "true",

    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "key.converter.schemas.enable": "false",
    "value.converter.schemas.enable": "false",

    "topic.prefix": "cdc",
    "heartbeat.interval.ms": "10000",
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "cdc.dlq"
  }
}

Why “tombstones” matter

Without a delete signal, downstream systems slowly diverge: search indexes show removed items, caches never evict, and analytics overcounts. Decide on delete propagation early.

Step 4 — Apply changes to the sink (upsert + “only accept newer”)

The sink is where CDC correctness is won. You want a deterministic rule: “For a given primary key, apply the newest change and ignore duplicates/older arrivals.” A common pattern is to store the last-applied position per row (or per key) and only update when the incoming position is newer.

-- Example sink table (current state) stores the last applied position.
-- The position can be a log sequence number (LSN), an offset, or a strictly increasing event_id.
--
-- target.orders_current(order_id PK, status, total_cents, updated_at, source_pos)

INSERT INTO target.orders_current (order_id, status, total_cents, updated_at, source_pos)
SELECT
  s.order_id,
  s.status,
  s.total_cents,
  s.updated_at,
  s.source_pos
FROM staging.orders_cdc AS s
WHERE s.op IN ('c', 'u')
ON CONFLICT (order_id) DO UPDATE
SET
  status      = EXCLUDED.status,
  total_cents = EXCLUDED.total_cents,
  updated_at  = EXCLUDED.updated_at,
  source_pos  = EXCLUDED.source_pos
WHERE EXCLUDED.source_pos > target.orders_current.source_pos;

-- Handle deletes explicitly (either hard-delete or soft-delete).
DELETE FROM target.orders_current t
USING staging.orders_cdc s
WHERE s.op = 'd'
  AND s.order_id = t.order_id
  AND s.source_pos > t.source_pos;

Mini checklist: “apply” logic that survives production

Upsert by primary key (avoid insert-only sinks)
Carry a monotonic position into the sink (per row or per key)
Ignore duplicates (same position) and out-of-order arrivals (older position)
Apply deletes deterministically (delete or mark deleted)

Step 5 — Make consumers restart-safe (checkpointing + idempotency)

You should be able to stop your pipeline mid-stream, restart it, and end up in the same state. That requires: (1) storing progress (checkpoint/offset), and (2) making reprocessing safe (idempotent writes).

import json
import psycopg2
from psycopg2.extras import execute_values

# Strategy:
# - Each CDC event has an (stream_id, position) tuple.
# - We write data changes and the checkpoint in ONE transaction.
# - If we crash and retry, the unique constraint prevents double-advancing incorrectly.

DDL = """
CREATE TABLE IF NOT EXISTS cdc_checkpoints (
  stream_id TEXT PRIMARY KEY,
  last_pos  TEXT NOT NULL,
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Optional: track processed positions to dedupe (useful if events can repeat with same pos).
CREATE TABLE IF NOT EXISTS cdc_processed (
  stream_id TEXT NOT NULL,
  pos       TEXT NOT NULL,
  PRIMARY KEY (stream_id, pos)
);
"""

def apply_batch(conn, stream_id: str, events: list[dict]) -> None:
  # events: [{"pos": "...", "op": "c|u|d", "key": {"order_id": 1}, "after": {...}}]
  with conn:
    with conn.cursor() as cur:
      # 1) Deduplicate positions (safe if batch repeats)
      execute_values(
        cur,
        "INSERT INTO cdc_processed(stream_id, pos) VALUES %s ON CONFLICT DO NOTHING",
        [(stream_id, e["pos"]) for e in events],
      )

      # 2) Apply changes (idempotent upsert/delete).
      # In real pipelines, route per table/topic and validate schema.
      for e in events:
        order_id = e["key"]["order_id"]
        pos = e["pos"]
        if e["op"] in ("c", "u"):
          after = e["after"]
          cur.execute(
            """
            INSERT INTO target.orders_current(order_id, status, total_cents, updated_at, source_pos)
            VALUES (%s, %s, %s, %s, %s)
            ON CONFLICT (order_id) DO UPDATE
            SET status = EXCLUDED.status,
                total_cents = EXCLUDED.total_cents,
                updated_at = EXCLUDED.updated_at,
                source_pos = EXCLUDED.source_pos
            WHERE EXCLUDED.source_pos > target.orders_current.source_pos
            """,
            (order_id, after["status"], after["total_cents"], after["updated_at"], pos),
          )
        elif e["op"] == "d":
          cur.execute(
            """
            DELETE FROM target.orders_current
            WHERE order_id = %s AND source_pos <= %s
            """,
            (order_id, pos),
          )

      # 3) Advance checkpoint to the max position in this batch.
      last_pos = max(e["pos"] for e in events)
      cur.execute(
        """
        INSERT INTO cdc_checkpoints(stream_id, last_pos)
        VALUES (%s, %s)
        ON CONFLICT (stream_id) DO UPDATE
        SET last_pos = EXCLUDED.last_pos, updated_at = now()
        """,
        (stream_id, last_pos),
      )

if __name__ == "__main__":
  conn = psycopg2.connect("dbname=warehouse user=etl password=secret host=127.0.0.1")
  with conn:
    with conn.cursor() as cur:
      cur.execute(DDL)

  # Demo batch (pretend it came from a broker/connector)
  demo = [
    {"pos": "00000016/000000A0", "op": "u", "key": {"order_id": 42},
     "after": {"status": "paid", "total_cents": 1299, "updated_at": "2026-01-09T12:00:00Z"}}
  ]
  apply_batch(conn, "orders_stream", demo)

A subtle checkpoint bug

Never advance a checkpoint unless the sink update is committed too. If you checkpoint first and crash, you’ll skip events and your sink will silently diverge.

Step 6 — Plan for schema changes (because they will happen)

CDC pipelines don’t fail because of “bad SQL.” They fail because schema changes are treated as surprises. Decide your rules:

Additive changes (safe): new nullable columns, new optional fields (usually safe to ignore at first)
Breaking changes: rename columns, change types, split/merge tables (requires versioning and coordinated rollout)
Event contract: do you ship full-row state (“after”) or a minimal patch? Full-row tends to be easier downstream.

A practical schema-evolution rule

If a change would require reinterpreting old events, create a new versioned stream (or new topic/table) and migrate consumers deliberately. Treat the schema as a contract, not an implementation detail.

Step 7 — Operate CDC like a production system

Signals to monitor

Lag (seconds/minutes behind)
Throughput and backpressure (events/sec)
Error rate + DLQ growth
Log retention risk (WAL/binlog/redo growth)

Data correctness checks

Daily reconciliation on counts and key aggregates
Spot-check recent updates across source and sink
Alerts for “no changes seen” on hot tables (often a connector stall)
Periodic re-backfill for small reference tables (cheap and safe)

Common mistakes

CDC bugs are rarely loud. They’re usually subtle: one missed update here, one duplicate there, and eventually your numbers don’t match. Here are the most common causes—plus the straightforward fixes.

Mistake 1 — Using `updated_at` as the only watermark

Timestamp polling can miss rows when multiple updates share the same timestamp or when clocks drift.

Fix: use a tie-breaker: (updated_at, id) and store both in the checkpoint.
Fix: re-read a small safety window (e.g., last 5–10 minutes) and dedupe by key + position.

Mistake 2 — Ignoring deletes

Downstream systems diverge slowly: stale cache entries, zombie search results, overcounted analytics.

Fix: emit tombstones (delete events) or soft deletes; apply them deterministically.
Fix: run reconciliation checks that include “missing keys,” not only totals.

Mistake 3 — No idempotency (duplicates break the sink)

At-least-once delivery means retries. Without idempotent writes, duplicates become data corruption.

Fix: upsert by primary key and keep a last-applied position per row/key.
Fix: make checkpoint advancement atomic with sink writes (same transaction when possible).

Mistake 4 — Overwriting newer data with older events

Out-of-order arrivals happen with partitions, retries, or multi-writer setups.

Fix: only accept updates where incoming position > stored position.
Fix: ensure your “position” is truly monotonic for the stream you consume.

Mistake 5 — Cutover without a consistent boundary

Snapshot and streaming must meet at a known point, or you lose changes during the snapshot window.

Fix: capture the stream start position from the snapshot and start CDC from that position.
Fix: validate with row counts + recent-key spot checks before switching readers.

Mistake 6 — Treating schema changes as “someone else’s problem”

A renamed column can stall a connector or poison downstream parsing.

Fix: version the contract (schema) and coordinate deployments.
Fix: prefer additive changes; plan migrations for breaking changes.

Debugging trick: find the first divergence

When source and sink disagree, don’t start by “reloading everything.” Pick a small set of recently updated keys and trace their change history (positions/offsets). The first missing or misordered event is usually the root cause.

FAQ

Is CDC the same as incremental ETL?

No. Incremental ETL usually means “process new/changed rows on a schedule,” often using timestamps. CDC is specifically about capturing committed changes as a stream (often from the database log) with checkpoints, replay, and delete semantics.

Do I need Kafka to do CDC?

No. Kafka is a common transport because it’s durable and replayable, but CDC can work with managed streams, queues, cloud migration services, or even direct apply-to-sink connectors. The core requirements are durability, ordering rules, and checkpointing—not a specific product.

Can CDC be “exactly once”?

Sometimes, but don’t assume it. Most CDC systems deliver at-least-once. You get correctness by making sinks idempotent and by advancing checkpoints only when sink writes commit. “Exactly once” typically requires transactional guarantees across transport + sink or a carefully designed apply/checkpoint transaction.

How do I handle deletes in CDC pipelines?

Explicitly. Either propagate hard deletes via tombstone/delete events, or implement soft deletes with a deleted flag. If you ignore deletes, your downstream systems will slowly drift, even if updates look correct.

What’s the most common CDC cutover sequence?

Backfill → stream → validate → switch. Load a snapshot into the sink, start CDC from a consistent boundary, apply changes until caught up, validate counts and recent updates, then point consumers/readers at the new sink.

What should I store as my checkpoint?

A monotonic position from the capture system. For log-based CDC, use the log position (LSN/offset/SCN). For timestamp polling, store (updated_at, id) and re-read a safety window. Avoid using wall-clock time alone as a checkpoint.

Cheatsheet

A scan-fast CDC checklist you can keep next to your pipeline dashboard.

Design decisions

Pick CDC mode: log-based / outbox / timestamp polling
Define keys: stable PKs, avoid mutable business keys
Define semantics: current state vs history (SCD2)
Define deletes: tombstone vs soft delete
Define schema evolution strategy: additive-first, version breaking changes

Correctness rules

Assume at-least-once delivery (duplicates happen)
Make sink idempotent (upsert + dedupe)
Carry a monotonic position into the sink
Only accept events with position > last applied
Advance checkpoint only after sink commits

Cutover plan

Backfill snapshot to staging
Record stream start position
Start CDC from that position
Apply changes until caught up
Validate (counts + recent keys), then switch readers

Operations

Alert on lag and stalled streams
Watch log retention vs lag (risk of losing changes)
Track DLQ volume and top error types
Run reconciliation checks (not just totals)
Document runbooks: restart, replay, and backfill procedures

If your CDC pipeline feels “fragile”

It’s usually missing one of these: a stable checkpoint, idempotent apply logic, or a delete strategy. Add those and most other issues become debuggable engineering problems instead of mysteries.

Wrap-up

CDC is one of those “invisible” systems that, when done right, feels like magic: data is fresh everywhere and nobody talks about it. When done wrong, it slowly erodes trust because numbers don’t match and nobody knows why.

The practical recipe is simple: choose the right capture method, treat changes as an ordered stream with checkpoints, make your sink idempotent, plan for deletes and schema changes, and use a safe cutover sequence (backfill → stream → validate → switch).

Next actions

Pick one table and implement CDC end-to-end (including deletes)
Add lag + DLQ + reconciliation checks before scaling to more tables
Write a short runbook: restart, replay, and “how to cut over”

How to scale CDC safely

Scale the operational maturity before you scale the table count. A CDC pipeline for one critical table is manageable. A CDC pipeline for fifty tables without monitoring and contracts is how teams end up back on full reloads.

UniLab Editorial

Modern learning notes for practical builders.

Change Data Capture (CDC): Sync Data Without Full Reloads

Quickstart

1) Decide your “source of truth” and latency target

2) Pick a CDC approach (log > triggers > timestamps)

3) Make the sink idempotent

4) Use the safe cutover sequence

A fast decision table

Overview

When CDC is worth it

When you can skip CDC

What this post covers

Core concepts

1) Change events

2) Checkpoints (a.k.a. “where are we?”)

CDC approaches (and why they behave differently)

Ordering, duplicates, and “exactly once”

What you can rely on

What you should not assume

Deletes: tombstones vs soft deletes

Step-by-step

Step 1 — Define the contract (the “truth” you want downstream)

Step 2 — Backfill a snapshot (get the baseline into the sink)

Snapshot checklist

Common cutover mistake

Step 3 — Start streaming changes (log-based example)

Step 4 — Apply changes to the sink (upsert + “only accept newer”)

Mini checklist: “apply” logic that survives production

Step 5 — Make consumers restart-safe (checkpointing + idempotency)

Step 6 — Plan for schema changes (because they will happen)

A practical schema-evolution rule

Step 7 — Operate CDC like a production system

Signals to monitor

Data correctness checks

Common mistakes

Mistake 1 — Using updated_at as the only watermark

Mistake 2 — Ignoring deletes

Mistake 3 — No idempotency (duplicates break the sink)

Mistake 4 — Overwriting newer data with older events

Mistake 5 — Cutover without a consistent boundary

Mistake 6 — Treating schema changes as “someone else’s problem”

FAQ

Is CDC the same as incremental ETL?

Do I need Kafka to do CDC?

Can CDC be “exactly once”?

How do I handle deletes in CDC pipelines?

What’s the most common CDC cutover sequence?

What should I store as my checkpoint?

Cheatsheet

Design decisions

Correctness rules

Cutover plan

Operations

Wrap-up

Next actions

Quiz

Related posts

Mistake 1 — Using `updated_at` as the only watermark