All posts

Database Migrations Without Downtime

EngineeringArchitecture

Schema migrations cause more production incidents than most teams want to admit. Not because the SQL is wrong, but because code and schema changes get deployed together in a way that leaves a window — sometimes just a few seconds — where the running application and the database are out of sync.

That window is when things break.

The root cause most teams miss

When you add a new NOT NULL column to a Postgres table with existing rows, the common instinct is to write one migration file: ALTER TABLE orders ADD COLUMN shipped_at TIMESTAMPTZ NOT NULL DEFAULT now(). Deploy it alongside the code that reads the new column. Done.

The problem appears when you have more than one app server. During a rolling deploy, old code and new code run simultaneously. Old code doesn't know about shipped_at. Depending on your ORM and how it constructs queries, that can cause errors for every request that hits an old instance — or worse, it can silently corrupt data before you notice.

And that's the easy case. Adding a NOT NULL column without a default on a large table will lock the entire table while Postgres rewrites it. Hundreds of milliseconds on a small table, minutes on a large one. Your app isn't down — it's just hanging. Same result.

The pattern: expand, migrate, contract

The fix is to treat schema and code changes as separate, ordered deployments. We call this the expand/contract pattern:

Step 1 — Expand: add the new column as nullable, with no constraints. Deploy this migration alone, before any application code changes. Postgres adds the column in milliseconds. No locks, no downtime. Old code runs against the new schema without issue because it ignores columns it doesn't know about.

Step 2 — Migrate: backfill existing rows, either in a one-time script or a background job that processes in batches. For large tables, batch writes in chunks of a few thousand rows with a short sleep between them. This keeps write pressure off the primary replica and avoids replication lag.

Step 3 — Deploy the application: now ship the code that reads and writes the new column. At this point, all rows already have a value and the schema is stable. No gap. No window.

Step 4 — Contract: once the new code has been running cleanly for a few days, add the NOT NULL constraint (using ADD CONSTRAINT ... NOT VALID followed by VALIDATE CONSTRAINT, which validates without a full table lock in Postgres 12+). Drop old columns or old indexes only after you're certain nothing reads them.

A real example

We ran this recently on a project where an orders table had grown to roughly 4 million rows. The requirement was to add a fulfillment_region column, derived from shipping address data that already existed in the row.

Doing it in one migration would have locked the table for about 90 seconds during peak hours. Instead:

  1. We added the nullable column in a 5ms migration deployed at 2am.
  2. A background worker processed the backfill in batches of 5,000 rows over the next two hours, during low traffic.
  3. The application code shipped the next morning during business hours. Zero incidents.
  4. The NOT NULL constraint was added three days later using NOT VALID + VALIDATE CONSTRAINT.

Total user-visible downtime: none.

Why this matters when evaluating a software studio

Every team will tell you they do deployments safely. Ask them specifically how they handle schema changes on tables with millions of rows. Ask how they coordinate application and database deploys across multiple running instances.

The answer tells you whether they've actually shipped production systems under real load — or whether they've only worked on products small enough that the bad patterns didn't bite them yet.

Migrations that cause downtime aren't caused by laziness. They're caused by a workflow that was never designed to handle scale. The expand/contract pattern adds a little overhead at the start: you write two migrations instead of one, and you coordinate two deployments instead of one. That overhead pays for itself the first time you don't have a midnight incident.