Database Migrations in Production: A Zero-Downtime Approach

Author David Park

Published on: February 3, 2023

It’s 3 PM on a Tuesday. Your migration is halfway through altering a 200-million-row table. The application is throwing 500 errors. Your CEO just messaged the engineering Slack channel with a screenshot of a blank page. The database is locked. You can’t roll back because the migration has been running for 40 minutes and canceling it would leave the table in a half-altered state.

Article Overview

Database Migrations in Production: A Zero-Downtime Approach

11 sections · Reading flow

01
Why Migrations Are Dangerous

→

02
The Cardinal Rule: Separate Deployment from…

→

03
Expand and Contract: The Core Pattern

→

04
Large Table Operations: The Techniques That…

→

05
Advisory Locks: Preventing Concurrent Migrations

→

06
Testing Migrations Before Production

→

07
Monitoring During Migrations

→

08
The Migration Runbook

→

09
Common Anti-Patterns We've Eliminated

→

10
Tooling That Helps

→

11
Multi-Database and Cloud-Managed Considerations

HARBOR SOFTWARE · Engineering Insights

We’ve been there. Not hypothetically — literally. This exact scenario happened to us in 2024, and it’s why we spent the next six months developing a rigorous zero-downtime migration practice that we now use on every Harbor Software project. This article documents the approach: the strategies, the tooling, the failure modes, and the hard-earned rules we follow to make sure nobody ever has to explain a production outage caused by a database migration again.

Why Migrations Are Dangerous

Database migrations are uniquely dangerous because they combine three properties that individually would be concerning and together are terrifying:

They’re irreversible in practice. Yes, you can write a “down” migration, but if your “up” migration took 40 minutes and corrupted interim data, the “down” migration is a theoretical comfort, not a practical safety net.
They hold locks. Most DDL operations in PostgreSQL and MySQL acquire locks that block reads, writes, or both. A lock that lasts milliseconds on your development database with 1,000 rows lasts minutes or hours on production with 200 million rows.
They run against live data. Unlike application deployments where you can blue-green or canary, a migration runs against the single source of truth. There’s no “deploy to 5% of the database.”

Understanding these properties is the foundation of zero-downtime migrations. Every technique we use is designed to mitigate one or more of them.

The Cardinal Rule: Separate Deployment from Migration

The most impactful change you can make to your migration practice is to stop running migrations as part of your deployment pipeline. When your CI/CD runs rails db:migrate or npx prisma migrate deploy as a deployment step, you’ve coupled application deployment with database changes. This means:

A slow migration blocks your deployment
A failed migration blocks your deployment
You can’t roll back the application without also rolling back the database
You can’t test the migration independently

Instead, treat migrations as a separate operational concern with their own testing, approval, and execution pipeline. At Harbor, we run migrations manually (or via a dedicated migration service) before deploying the application code that depends on them. The application code must be compatible with both the pre-migration and post-migration database state.

This is the expand-and-contract pattern, and it governs everything else in this article.

Expand and Contract: The Core Pattern

Every schema change follows a three-phase lifecycle:

Phase 1: Expand

Add the new structure alongside the old one. Do not remove or modify anything existing. After this phase, both old and new application code work correctly.

Phase 2: Migrate

Deploy application code that uses the new structure. Backfill data if needed. Verify everything works. This phase can last hours, days, or even weeks.

Phase 3: Contract

Remove the old structure once you’re confident the new one is correct and all consumers have been updated. This is the only phase where you remove something, and it happens long after the risky work is done.

Here’s a concrete example. Say you need to rename a column from user_name to display_name:

-- WRONG: This breaks the application instantly
ALTER TABLE users RENAME COLUMN user_name TO display_name;

-- RIGHT: Expand-and-contract approach

-- Phase 1: Expand (safe, no locks on reads)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

-- Phase 1b: Backfill existing data
UPDATE users SET display_name = user_name WHERE display_name IS NULL;
-- (For large tables, do this in batches — see below)

-- Phase 1c: Add trigger to keep both columns in sync
CREATE OR REPLACE FUNCTION sync_display_name()
RETURNS TRIGGER AS $$
BEGIN
  IF NEW.display_name IS NULL AND NEW.user_name IS NOT NULL THEN
    NEW.display_name := NEW.user_name;
  END IF;
  IF NEW.user_name IS NULL AND NEW.display_name IS NOT NULL THEN
    NEW.user_name := NEW.display_name;
  END IF;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER tr_sync_display_name
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_display_name();

-- Phase 2: Deploy app code that reads/writes display_name
-- Wait. Verify. Monitor.

-- Phase 3: Contract (days or weeks later)
DROP TRIGGER tr_sync_display_name ON users;
DROP FUNCTION sync_display_name();
ALTER TABLE users DROP COLUMN user_name;

Yes, this is more work than a single RENAME COLUMN. That’s the trade-off. More work upfront, zero downtime, zero risk of data loss, and the ability to roll back at any phase by simply deploying the previous application version.

Large Table Operations: The Techniques That Actually Work

The strategies above handle structural changes safely. But what about operations on large tables — adding an index, altering a column type, or backfilling millions of rows? These are where most production outages happen.

Adding Indexes: CONCURRENTLY or Don’t Bother

A standard CREATE INDEX in PostgreSQL acquires a SHARE lock on the table, which blocks all writes for the duration of index creation. On a 200-million-row table, that can be hours.

-- DANGEROUS: Blocks writes for the entire duration
CREATE INDEX idx_users_email ON users (email);

-- SAFE: Doesn't block writes (takes longer but who cares)
CREATE INDEX CONCURRENTLY idx_users_email ON users (email);

CREATE INDEX CONCURRENTLY takes roughly 2-3x longer than a regular index creation, but it doesn’t hold any locks that block DML. The trade-off is worth it every single time in production.

One critical caveat: CONCURRENTLY cannot run inside a transaction. If your migration framework wraps everything in a transaction (many do), you need to run this migration outside the framework or configure it to skip the transaction wrapper for this specific migration.

# In a Rails migration, for example:
class AddEmailIndexToUsers < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!  # Critical: allows CONCURRENTLY
  
  def change
    add_index :users, :email, algorithm: :concurrently
  end
end

Altering Column Types: The Shadow Column Strategy

Changing a column’s type (e.g., INT to BIGINT, or VARCHAR(100) to TEXT) often requires a full table rewrite. In PostgreSQL, ALTER TABLE ALTER COLUMN TYPE rewrites every row and holds an ACCESS EXCLUSIVE lock for the duration — the most restrictive lock possible, blocking even reads.

The shadow column strategy avoids this entirely:

-- Step 1: Add a new column with the target type
ALTER TABLE orders ADD COLUMN amount_new BIGINT;

-- Step 2: Backfill in batches (see batched backfill section)
-- Step 3: Add trigger to sync writes to both columns
-- Step 4: Deploy app code that reads from amount_new
-- Step 5: Verify data consistency
-- Step 6: Rename columns (fast metadata-only operation in PG 12+)
ALTER TABLE orders RENAME COLUMN amount TO amount_old;
ALTER TABLE orders RENAME COLUMN amount_new TO amount;
-- Step 7: Clean up trigger and old column

Batched Backfills: The Only Way to Update Millions of Rows

Never run an unbounded UPDATE on a large table. It generates a massive write-ahead log, holds row locks for the entire duration, and can cause replication lag, vacuum issues, and OOM kills.

Instead, update in batches with explicit pauses:

-- Batched backfill script
DO $$
DECLARE
  batch_size INT := 5000;
  affected INT;
  total_updated INT := 0;
BEGIN
  LOOP
    UPDATE users
    SET display_name = user_name
    WHERE id IN (
      SELECT id FROM users
      WHERE display_name IS NULL
      LIMIT batch_size
      FOR UPDATE SKIP LOCKED  -- Don't block on locked rows
    );
    
    GET DIAGNOSTICS affected = ROW_COUNT;
    total_updated := total_updated + affected;
    
    RAISE NOTICE 'Updated % rows (% total)', affected, total_updated;
    
    EXIT WHEN affected = 0;
    
    -- Pause between batches to let replication catch up
    -- and avoid overwhelming the WAL
    PERFORM pg_sleep(0.1);
  END LOOP;
END $$;

The FOR UPDATE SKIP LOCKED is important — it prevents the backfill from blocking application writes on the same rows. If a row is currently locked by an application transaction, the backfill skips it and picks it up in a later batch.

We typically use a batch size of 5,000-10,000 rows with a 100ms pause between batches. This gives us a throughput of about 50,000-100,000 rows per second while keeping replication lag under 1 second and WAL generation manageable.

Advisory Locks: Preventing Concurrent Migrations

If two developers accidentally run migrations at the same time against the same database, the results are unpredictable and usually catastrophic. We use PostgreSQL advisory locks to ensure only one migration runs at a time:

-- At the start of every migration script
SELECT pg_advisory_lock(42);  -- 42 is our "migration lock" key

-- ... migration operations ...

-- At the end
SELECT pg_advisory_unlock(42);

Our migration runner wraps this automatically. If the lock can’t be acquired (because another migration is running), it fails immediately with a clear error message instead of proceeding and creating a race condition.

Testing Migrations Before Production

Every migration runs through three environments before touching production:

Local development: Against a database with seed data. This catches syntax errors and logic bugs.
Staging with production-scale data: We maintain a staging database that’s a recent anonymized snapshot of production. This catches performance issues — a migration that takes 2 seconds on 10,000 rows might take 2 hours on 200 million.
Production dry run: For high-risk migrations, we run a dry run on a read replica. This gives us exact timing and resource consumption metrics without affecting production.

# Our pre-migration checklist (enforced by CI)
- [ ] Migration tested locally
- [ ] Migration tested on staging (with timing logged)
- [ ] No ACCESS EXCLUSIVE locks held for more than 1 second
- [ ] All index creations use CONCURRENTLY
- [ ] All large updates use batched backfill
- [ ] Rollback procedure documented and tested
- [ ] Application code compatible with pre-AND-post-migration schema
- [ ] Monitoring alerts configured for migration-related metrics

Monitoring During Migrations

When a migration is running in production, we monitor these metrics in real time:

Lock wait time: Are application queries waiting for locks held by the migration? If so, the migration is too aggressive.
Replication lag: Is the replica falling behind? If lag exceeds 5 seconds, we pause the backfill.
Connection pool utilization: Is the migration consuming too many connections?
Query latency percentiles: p50 and p99 for the application’s most critical queries. Any degradation signals that the migration is affecting production performance.
WAL generation rate: A sudden spike in WAL write rate can indicate a runaway backfill that will eventually fill up disk space.

-- Real-time lock monitoring during migrations
SELECT
  blocked_locks.pid AS blocked_pid,
  blocked_activity.usename AS blocked_user,
  blocked_activity.query AS blocked_query,
  blocking_locks.pid AS blocking_pid,
  blocking_activity.usename AS blocking_user,
  blocking_activity.query AS blocking_query,
  now() - blocked_activity.query_start AS wait_time
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
  ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
  ON blocking_locks.locktype = blocked_locks.locktype
  AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
  AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
  AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
  AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
  AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
  AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
  AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
  AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
  AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
  AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
  ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

The Migration Runbook

Every migration at Harbor Software follows this runbook. No exceptions, no shortcuts, no “this one is simple enough to skip the process.” The simple ones are where complacency causes incidents.

Before the Migration

Write the migration with expand-and-contract phases clearly separated
Write the rollback procedure for each phase
Test on local and staging environments
Log timing and resource usage on staging
Get a review from a second engineer
Schedule the migration during low-traffic hours (we use Tuesday/Wednesday 10 AM — never Fridays, never before holidays)
Notify the team in Slack with the migration plan, expected duration, and rollback procedure

During the Migration

Take a logical backup of affected tables (not a full pg_dump — just the relevant tables or a snapshot if using cloud-hosted PostgreSQL)
Open the monitoring dashboard on a second screen
Run the expand phase
Verify: table structure is correct, new columns exist, no errors in application logs
Run the backfill (if needed) — monitor lock waits and replication lag
Deploy application code that uses the new structure
Monitor for 24 hours minimum

After the Migration (Days Later)

Verify data consistency between old and new structures
Run the contract phase (remove old columns, triggers, temporary structures)
Update documentation and schema diagrams
Run ANALYZE on affected tables so the query planner has accurate statistics

Common Anti-Patterns We’ve Eliminated

“It’s a small table, we don’t need the process.” Small tables have foreign keys to large tables. A schema change on a 100-row lookup table can lock a 200-million-row transaction table if there’s a foreign key constraint being validated.

Running migrations during deployment. We covered this earlier, but it bears repeating. Deployment should deploy code. Migrations should run independently, with their own approval and monitoring.

NOT NULL constraints on existing columns. Adding NOT NULL to an existing column requires PostgreSQL to scan every row to verify the constraint. On large tables, this acquires an ACCESS EXCLUSIVE lock for the scan duration. Instead, add a CHECK constraint with NOT VALID, then validate it separately:

-- DANGEROUS: Full table scan with ACCESS EXCLUSIVE lock
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;

-- SAFE: Two-step approach
ALTER TABLE orders ADD CONSTRAINT orders_status_not_null
  CHECK (status IS NOT NULL) NOT VALID;  -- Instant, no scan

ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null;
  -- Scans table but only holds SHARE UPDATE EXCLUSIVE lock
  -- (allows reads AND writes during validation)

Changing enum types in PostgreSQL. Adding a value to a PostgreSQL enum is fast and safe. Removing or renaming a value requires a full rewrite of every table that uses the enum. If you need flexible value sets, use a text column with a check constraint instead of an enum.

Tooling That Helps

We’ve evaluated and used several tools that make zero-downtime migrations easier:

pg_repack: Rebuilds tables online without exclusive locks. Useful for CLUSTER operations and removing bloat.
pgroll: A schema migration tool specifically designed for zero-downtime changes. It automates the expand-and-contract pattern with shadow columns and views.
strong_migrations (Ruby): A gem that detects dangerous migration patterns and suggests safe alternatives. We use it as a CI check — any migration with an unsafe operation fails the build.
squawk (SQL linter): Analyzes SQL migration files for operations that could cause downtime. Language-agnostic.

Multi-Database and Cloud-Managed Considerations

Everything above assumes you’re running PostgreSQL on infrastructure you control. If you’re using a managed service like AWS RDS, Google Cloud SQL, or Azure Database, some techniques need adjustment.

RDS and Cloud SQL impose connection limits that are lower than what a self-managed PostgreSQL would allow. Our batched backfill approach already accounts for this, but you also need to be aware that managed databases often have automated maintenance windows where they apply patches and may briefly restart. Never schedule a long-running migration during or near a maintenance window.

Read replicas can save you. Most managed PostgreSQL services offer read replicas with minimal setup. Use them as dry-run targets. Run your migration against the replica first, measure the exact timing and resource consumption, then run against the primary with confidence. The replica will resync automatically after you’re done.

Point-in-time recovery is your nuclear option. Before any high-risk migration, note the current timestamp. If the migration causes catastrophic data corruption that your backups don’t cover, you can restore the entire database to a point in time before the migration started. This is a last resort — it means losing all data written between the restore point and now — but knowing it exists lets you take calculated risks you wouldn’t otherwise take.

# RDS point-in-time recovery command (note the timestamp BEFORE migration)
aws rds restore-db-instance-to-point-in-time 
  --source-db-instance-identifier myapp-production 
  --target-db-instance-identifier myapp-recovery 
  --restore-time 2026-03-30T13:00:00Z

Zero-downtime migrations require more planning, more testing, and more patience than the alternative. But “the alternative” is production outages, data corruption, and late-night incident responses. We’ll take the extra planning every time.

If you’re dealing with a large database and worried about your migration strategy, or if you’re recovering from a migration-caused outage and want to make sure it doesn’t happen again, reach out. We’ve done this enough times to have opinions about all the edge cases.

Database Migrations in Production: A Zero-Downtime Approach

Why Migrations Are Dangerous

The Cardinal Rule: Separate Deployment from Migration

Expand and Contract: The Core Pattern

Phase 1: Expand

Phase 2: Migrate

Phase 3: Contract

Large Table Operations: The Techniques That Actually Work

Adding Indexes: CONCURRENTLY or Don’t Bother

Altering Column Types: The Shadow Column Strategy

Batched Backfills: The Only Way to Update Millions of Rows

Advisory Locks: Preventing Concurrent Migrations

Testing Migrations Before Production

Monitoring During Migrations

The Migration Runbook

Before the Migration

During the Migration

After the Migration (Days Later)

Common Anti-Patterns We’ve Eliminated

Tooling That Helps

Multi-Database and Cloud-Managed Considerations

You may also like

Leave a comment Cancel reply