Data Engineering & Databases · Schema Changes

Schema Migrations Without Downtime

Safe patterns for evolving schemas in production.

Reading time: ~8–12 min
Level: All levels
Updated:

“Schema changes are easy” is only true on an empty database. In production, a migration can cause downtime because it blocks writes, rewrites large tables, or fights with long-running transactions. This guide shows the battle-tested patterns behind schema migrations without downtime—so you can evolve your database while traffic stays up.


Quickstart

If you need a safe path today, use this. It’s the shortest route to “no surprises” migrations, even if your schema changes are big.

1) Default to the Expand → Backfill → Contract pattern

Most downtime comes from trying to “change the schema and app at the same time.” Expand/contract avoids that: you add new things first, migrate data gradually, then remove old things later.

  • Expand: add new columns/tables/indexes (backwards compatible)
  • Backfill: copy/transform data in small batches
  • Contract: flip reads, then remove legacy paths and schema

2) Put a “lock budget” on every migration

A migration that waits forever for a lock is a production incident waiting to happen. Decide how long you’re willing to wait, and fail fast if you can’t get the lock.

  • Set a lock timeout (seconds, not minutes)
  • Set a statement timeout for long DDL/DML
  • Run migrations with a clear rollback plan

3) Make the app backward compatible before the risky step

The simplest safety trick: deploy application code that can read/write both the old and new shape before you fully switch.

  • Reads: prefer new field, fallback to old
  • Writes: dual-write during transition (or trigger-based)
  • Use a feature flag for the final cutover

4) Backfill like a background job, not a single SQL blast

Big updates are how you create replication lag, lock contention, and dead tuples. Backfill incrementally and throttle.

  • Chunk by primary key or timestamp
  • Keep transactions small and idempotent
  • Monitor lag/CPU/IO and slow down if needed
A practical definition of “no downtime”

“No downtime” doesn’t mean “no locks ever.” It means you avoid long blocking locks and you keep user-facing requests succeeding while the migration runs. In practice: short locks + retry-safe app behavior.

Preflight checklist (2 minutes)

  • Is this change backward compatible?
  • Will it rewrite a big table or rebuild indexes?
  • Can it be done as “add new + migrate” instead of “change in place”?
  • Do you have a safe rollback (code + schema)?
  • Are there long-running transactions right now?
  • Is replication lag low and stable?
  • Is your migration tool running in the same region/network?
  • Do you have monitoring on errors, latency, locks, and lag?

Overview

Schema migrations touch your most sensitive shared resource: the database. When they go wrong, they go wrong loudly: blocked writes, slow queries, deadlocks, replication lag, and rollbacks that take longer than the migration itself. The goal of schema migrations without downtime is to make changes boring—even when your dataset is large.

What this post covers

  • The mental model: Expand → Backfill → Contract
  • Compatibility rules: what “backward compatible” means in practice
  • How to backfill safely (batching, idempotency, throttling)
  • How to cut over reads/writes with minimal risk
  • Common “looks safe, breaks prod” pitfalls and fixes
  • How to think about locks and table rewrites
  • When to use online/ghost migration tools vs built-in DDL
  • Migration runbooks: preflight, execution, monitoring, rollback
  • FAQ for the migrations people actually search for
  • A cheatsheet you can keep open during deploys

This guide is database-agnostic, but examples lean toward common production setups (PostgreSQL/MySQL and ORM-backed apps). The patterns apply broadly because they’re about compatibility and operational safety, not one vendor’s syntax.

Why downtime happens during migrations
  • Blocking locks: DDL often requires locks that stop reads/writes.
  • Table rewrites: some changes rewrite the whole table (slow + lock-heavy).
  • Long transactions: they prevent schema changes and vacuum/cleanup.
  • Big backfills: huge updates create contention, bloat, and lag.

Core concepts

Most teams don’t need “fancier migrations.” They need a shared vocabulary and a couple of default patterns. Once everyone agrees on the mental model, migrations become repeatable.

1) Backward vs forward compatibility

Backward compatible change

Old application code keeps working after the schema change. This is what you want during “Expand.”

  • Adding a nullable column
  • Adding a new table (unused until code deploy)
  • Adding an index (careful: can still be heavy)

Forward compatible change

New application code works even if the schema change isn’t present yet. This is useful for safe rollouts and rollback.

  • Reads tolerate missing columns (feature-flagged)
  • Writes don’t require the new schema until enabled
  • Dual-read logic (new preferred, old fallback)

2) The Expand → Backfill → Contract pattern

This is the default approach for most “risky” changes: rename a column, split a table, change data type, introduce a new relationship, or add a non-trivial constraint.

Why it works

  • You avoid a single “big bang” step where schema + code must switch together.
  • You reduce lock duration by using additive changes first.
  • You can pause or roll back at multiple points without data loss.
  • You migrate data at the pace production can afford.

3) Locks, lock budgets, and the “blast radius”

The operational risk of a migration is mostly determined by: which locks it needs and how long it holds them. You can’t avoid locks entirely, but you can avoid waiting forever.

A simple lock budget rule

Migration step Acceptable lock behavior What to do if it can’t get the lock
Additive DDL (add column / add table) Short lock only Fail fast, retry later (off-peak)
Index build Prefer online/concurrent builds Run off-peak, monitor IO, throttle
Backfill writes Small transactions, low contention Reduce batch size, pause on lag
Contract (drop column / drop old index) Short lock, reversible if delayed Defer cleanup until stability window

4) “Safe vs risky” schema operations

What’s “safe” depends on the database engine and version, but these categories are practical for planning. When in doubt, treat “risky” as “needs expand/contract.”

Change Risk level Safer pattern
Add nullable column Low Do it as Expand; backfill later
Add NOT NULL / default on a large existing table High Add nullable → backfill → validate → enforce
Rename column used by app High Add new column → dual-write → cutover → drop old
Change column type (int → uuid, text → json) High Shadow column + backfill + dual-read → switch
Add foreign key constraint on big table Medium–High Add constraint in a way you can validate incrementally; plan validation load
Drop column / drop table Medium Defer to Contract phase after stability period
The biggest hidden risk: long-lived transactions

Even a “fast” schema change can hang if a long-running transaction is open (idle-in-transaction, a slow report, a stuck job). Make “check for long transactions” part of your preflight and make migrations time out rather than wait forever.

Step-by-step

Below is a practical runbook you can adapt to most production systems. The example change we’ll use is common: you want to replace users.full_name with users.first_name/users.last_name, without breaking old app versions and without pausing writes.

Step 0 — Classify the change and choose a strategy

Use a “strategy picker”

  • Additive only? You can usually do it directly (still with lock timeouts).
  • In-place rewrite likely? Use expand/contract or an online schema tool.
  • Requires transforming existing rows? Plan a backfill job with batching + idempotency.
  • Touches hot tables? Run off-peak and keep locks extremely short.

Decide your success criteria

  • Max acceptable added latency (p95/p99)
  • Max acceptable error increase
  • Max acceptable replication lag
  • Rollback threshold (what triggers an abort)

Step 1 — Expand: add new schema in a backward compatible way

Expand changes should be add-only. They should not require the application to change immediately, and they should be safe to deploy even if you need to roll back the app.

Example: expand with lock/statement timeouts

This example uses common PostgreSQL-style safeguards. Even if you use another database, the operational idea is the same: fail fast if you can’t get the lock, and avoid long-running DDL during peak traffic.

-- Expand phase (example)
-- Keep migrations from waiting forever if the table is busy.
SET lock_timeout = '3s';
SET statement_timeout = '5min';

-- 1) Add new nullable columns (backward compatible)
ALTER TABLE users
  ADD COLUMN first_name text,
  ADD COLUMN last_name  text;

-- 2) Add an index for the future query pattern (use the online/concurrent form if available)
-- In PostgreSQL, this must be run outside a transaction block:
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_last_name ON users (last_name);

-- 3) Optional: add a trigger to keep columns in sync during the transition
-- (Some teams prefer dual-write in the app instead.)
Triggers vs dual-write

Dual-write in the app is easier to version-control and test. Triggers are useful if you have multiple writers (many services) and want consistency without coordinating releases. Pick one and keep the transition window short.

Step 2 — Deploy application code that supports both schemas

Compatibility rules for the app

  • Reads: prefer new columns, fall back to old
  • Writes: write both (or write new + trigger old) during transition
  • Validation: don’t require new fields until after backfill
  • Feature flag: gate the final switch (and keep a rollback path)
  • Be tolerant of nulls on new columns at first
  • Keep old queries working until Contract
  • Run a canary release before full rollout
  • Add metrics: % reads using new path, % rows backfilled

Step 3 — Backfill data in batches (throttled + idempotent)

Backfill is where “no downtime” is won or lost. The safe approach: update a small chunk, commit, sleep if needed, and repeat until complete. You want backfill to behave like a background job, not a single huge transaction.

Backfill checklist

  • Chunk by primary key (stable ordering)
  • Keep each transaction small (hundreds/thousands of rows)
  • Make it restartable (idempotent updates)
  • Throttle based on load/replication lag
  • Log progress and estimate completion time

What to monitor during backfill

  • DB CPU and IO utilization
  • Lock waits and deadlocks
  • Replication lag
  • p95/p99 latency on critical endpoints
  • Row bloat / vacuum pressure (engine dependent)

Example: a safe batched backfill loop

This pattern updates only rows that still need backfill, uses a stable paging key, and commits each chunk. It’s designed to be safe to re-run (it won’t keep rewriting already-migrated rows).

import os
import time
import psycopg2

DSN = os.environ["DATABASE_URL"]

BATCH_SIZE = 2000
SLEEP_SECONDS = 0.10  # small pause to reduce contention

def split_name(full_name: str):
    parts = (full_name or "").strip().split()
    if not parts:
        return None, None
    if len(parts) == 1:
        return parts[0], None
    return parts[0], " ".join(parts[1:])

with psycopg2.connect(DSN) as conn:
    conn.autocommit = False
    with conn.cursor() as cur:
        last_id = 0

        while True:
            cur.execute(
                """
                SELECT id, full_name
                FROM users
                WHERE id > %s
                  AND (first_name IS NULL OR last_name IS NULL)
                ORDER BY id
                LIMIT %s
                """,
                (last_id, BATCH_SIZE),
            )
            rows = cur.fetchall()
            if not rows:
                break

            for user_id, full_name in rows:
                first, last = split_name(full_name)
                cur.execute(
                    """
                    UPDATE users
                    SET first_name = COALESCE(first_name, %s),
                        last_name  = COALESCE(last_name,  %s)
                    WHERE id = %s
                    """,
                    (first, last, user_id),
                )
                last_id = user_id

            conn.commit()
            time.sleep(SLEEP_SECONDS)

print("Backfill complete.")
Don’t backfill under “default isolation assumptions”

If your backfill reads and writes the same hot rows as production traffic, you can create contention or deadlocks. Use small batches, index the selection predicate if needed, and consider running backfill at a lower priority/off-peak.

Step 4 — Cut over reads (then writes) with a feature flag

The safest cutover is a controlled, observable switch. Flip reads first (so you can validate behavior without changing data), then stop writing to the old shape.

A conservative cutover sequence

  1. Canary: enable “read new columns” for a small % of traffic.
  2. Validate: monitor errors, latency, and mismatches (old vs new values).
  3. Ramp up: move to 25% → 50% → 100% if stable.
  4. Stop dual-write: once reads are on the new path and backfill is complete.
  5. Stability window: keep old schema for a while to allow fast rollback.

Step 5 — Contract: remove old paths and clean up the schema

Contract is where you remove the old column/index/table, but only after you’ve proven the system runs on the new shape. Treat cleanup as its own deploy: it’s lower urgency and shouldn’t be rushed.

Example: Contract with safety checks

Before dropping anything, verify no code path still depends on it. If you can’t prove it, wait longer. In many teams, Contract happens days later, not minutes later.

#!/usr/bin/env bash
set -euo pipefail

# Contract phase (example)
# 1) Verify usage is gone (application metrics/logs should show 0 reads/writes to old field)
# 2) Apply cleanup with short lock timeouts and clear failure behavior.

export PGOPTIONS="-c lock_timeout=3s -c statement_timeout=5min"

# Drop old column (after a stability window)
psql "$DATABASE_URL" -v ON_ERROR_STOP=1 <<'SQL'
ALTER TABLE users
  DROP COLUMN full_name;
SQL

echo "Contract complete."
When to use online schema change tools

If you must change a hot table in-place (e.g., certain type changes or large table rewrites), consider “online schema change” approaches: create a shadow table, sync changes, then swap. Many ecosystems have tools for this (often called “ghost/online” migrations). They add complexity, but they can reduce lock time in exchange for extra disk/CPU and careful orchestration.

Common mistakes

These are the patterns behind “the migration looked fine in staging.” The fixes are usually simple—if you plan them in advance.

Mistake 1 — Combining schema and app cutover into one step

“Deploy code + rename column + backfill + drop old column” is a single point of failure.

  • Fix: split into Expand → Deploy compatibility → Backfill → Cutover → Contract.
  • Fix: add a feature flag for the cutover, not for the schema change.

Mistake 2 — Adding NOT NULL or heavy defaults on a big table

Some engines rewrite the table or take strong locks to verify or fill defaults.

  • Fix: add nullable → backfill → validate → enforce.
  • Fix: enforce in the app first, then in the database.

Mistake 3 — Building indexes in the most blocking way

Index builds can saturate IO and block writes depending on the method.

  • Fix: use concurrent/online index build options when available.
  • Fix: run off-peak and monitor IO and query latency.

Mistake 4 — Running a massive backfill in one transaction

Big updates create long locks, replication lag, and painful rollbacks.

  • Fix: batch by primary key, commit each batch.
  • Fix: throttle and pause automatically when lag/latency rises.

Mistake 5 — Forgetting about long-running transactions

A single long transaction can make DDL wait indefinitely (and you won’t notice until it’s too late).

  • Fix: check for long/idle-in-transaction sessions before migrating.
  • Fix: set lock timeouts so the migration fails fast instead of hanging.

Mistake 6 — Dropping the old column too soon

Rollback becomes hard if you remove the old schema before you’re confident in the new path.

  • Fix: keep a stability window (hours/days) before Contract cleanup.
  • Fix: measure “old path usage” and require it to be zero.

Mistake 7 — Changing “meaning” without versioning

Schema changes are often semantic changes: enums, states, or business rules shift.

  • Fix: version the meaning in code (new enum value, new column) and migrate gradually.
  • Fix: keep old meanings readable until you complete the transition.

Mistake 8 — No rollback plan (or a rollback that makes it worse)

“We’ll just roll back” doesn’t work if the rollback requires another risky migration.

  • Fix: plan rollback at each phase: Expand is easy, Contract is harder.
  • Fix: design cutovers so rollback is a feature flag flip, not DDL.
The simplest anti-incident rule

If a migration step can block production traffic, it must have: (1) a timeout, (2) monitoring, and (3) a safe abort path. No exceptions.

FAQ

Can I rename a column without downtime?

Treat “rename” as “create a new name.” Add a new column, dual-write (or sync via trigger), backfill, then switch reads. Only drop the old column during the Contract phase after a stability window. This avoids breaking old app versions and avoids risky in-place operations.

How do I add a NOT NULL constraint safely?

Do it in stages: add the column as nullable, backfill existing rows, update the application to always write a value, then enforce NOT NULL once you can prove there are no nulls left. This sequence avoids heavy locks and avoids blocking writes.

What’s the safest way to change a column’s type?

Use a “shadow column” of the new type. Write to both (or compute new from old), backfill historical data in batches, switch reads, then remove the old column. Type changes that rewrite a large table are a common source of downtime.

Should I run migrations as part of the app deploy?

It depends. Additive, low-risk migrations can run automatically. Risky migrations (type changes, big constraints, large backfills) should be run as a controlled operation with monitoring and clear abort rules. A good compromise is: run “Expand” automatically, run “Backfill” as a job, and run “Contract” as a separate cleanup deploy.

How do I avoid breaking older app versions during rollout?

Make Expand changes backward compatible, and make your app forward compatible with feature flags and tolerant reads. During the transition, ensure the DB contains both representations (old and new) so any running version can operate safely.

What if I have multiple services writing to the same table?

You need a coordination strategy. Either (1) update all services to dual-write during the transition, or (2) use database-side synchronization (trigger) so writers don’t need to change simultaneously. Keep the transition window short and document ownership of the cutover decision.

When is “maintenance window” still the right answer?

If the change is fundamentally disruptive, you can’t tolerate extra complexity, or your system has low traffic and high safety requirements, a small maintenance window can be rational. The key is to be honest: if the migration requires a long lock or a table rewrite, schedule it intentionally rather than pretending it’s “online.”

Cheatsheet

Keep this open during deploys. It’s the “do the boring safe thing” checklist for schema migrations without downtime.

The default runbook

Phase Goal Rules of thumb
Preflight Know the risks Check long transactions, lag, peak traffic; define lock budget and rollback triggers
Expand Add new schema Additive changes only; keep locks short; prefer online index builds
Compatibility deploy App works with both Dual-read (new then old); dual-write or trigger; feature flag for cutover
Backfill Migrate old rows Batch + commit; throttle; idempotent; monitor lag/latency; pause safely
Cutover Switch to new Canary first; observe; ramp; stop dual-write only after confidence
Contract Cleanup Wait for stability; prove old path usage is zero; drop old columns/indexes last

Safety switches

  • Lock timeout set (seconds)
  • Statement timeout set (minutes)
  • Migration can be safely retried
  • Feature flag exists for cutover
  • Backfill has pause/resume and progress logs

“Stop the migration” signals

  • p95/p99 latency spikes beyond budget
  • Error rates rise above threshold
  • Replication lag grows continuously
  • Lock waits or deadlocks increase
  • Database CPU/IO is saturated
If you only memorize one pattern

Prefer additive schema changes, keep old behavior working until the new path is proven, and do data movement in batches. That’s 90% of “schema migrations without downtime.”

Wrap-up

Reliable schema evolution is a skill—and it’s mostly operational discipline, not fancy SQL. If you adopt one habit, make it this: never rely on a single “big switch” in production. Expand safely, migrate gradually, cut over with observation, then clean up later.

What to do next

  • Pick one upcoming migration and rewrite it into Expand → Backfill → Contract steps.
  • Add a lock timeout/statement timeout policy to your migration runner.
  • Build a standard backfill job template (batching + throttling + idempotency).
  • Add “old path usage” metrics so Contract cleanup is safe and automatic.

A quick self-audit

  • Do we have a documented migration runbook?
  • Can we safely stop a migration mid-way?
  • Can we roll back a cutover without DDL?
  • Do we delay Contract cleanup until after stability?

If you want more depth, the related posts below cover SQL performance, CDC, and Postgres features that often show up in migration planning.

Quiz

Quick self-check (demo). This quiz is auto-generated for data / engineering / databases.

1) Which sequence best matches the safest pattern for schema migrations without downtime?
2) What is the purpose of a “lock budget” (e.g., lock timeout) in production migrations?
3) Why is batched backfill usually safer than a single massive UPDATE?
4) When is it safest to drop the old column in an expand/contract migration?