A practical guide to zero-downtime deployments
How to ship changes to a live service without dropping a single request — strategies, trade-offs, and a checklist that works in the real world.
Last updated: 2026-06-24
Downtime during a deploy is rarely a tooling problem — it is usually a sequencing problem. The moment you start treating a release as a series of carefully ordered steps rather than a single “flip the switch” event, most outages disappear. This guide walks through the deployment patterns we use to push changes to production services without interrupting traffic, and the practical trade-offs of each.
Start with health checks that mean something
Every zero-downtime strategy depends on the load balancer knowing, accurately, whether an instance is ready to serve. A health check that simply returns 200 from the web server is not enough: it should confirm the application has finished booting, has an open database connection, and can serve a real (cheap) request. Separate your “liveness” check (is the process alive?) from your “readiness” check (should it receive traffic yet?). Sending traffic to an instance that is still warming up is one of the most common causes of error spikes during an otherwise clean deploy.
Rolling deployments
The simplest approach is a rolling update: replace instances a few at a time, waiting for each new instance to pass its readiness check before draining and retiring an old one. It needs no extra infrastructure and keeps capacity roughly constant. The catch is that during the roll, two versions of your code are serving simultaneously, so the new version must be backward-compatible with the old one — especially at the database layer.
Blue-green and canary
Blue-green keeps two complete environments. You deploy to the idle one, verify it, then switch the router over in a single step — with an equally fast switch back if something is wrong. It is the most predictable rollback you can buy, at the cost of running double the infrastructure during a release.
Canary releases send a small slice of traffic (say 5%) to the new version, watch error rates and latency, and only widen the rollout if the metrics stay healthy. Canary is the best early-warning system of the three, but it requires real observability — if you cannot compare the canary’s error rate to the baseline in near real time, you are just guessing.
The database is where deploys actually break
Code rolls back in seconds; schema changes do not. The discipline that makes everything above safe is the expand-and-contract pattern. First expand: add the new column or table in a backward-compatible migration and deploy code that can read both shapes. Then migrate the data. Only once the old code is fully gone do you contract: drop the obsolete column. Never rename a column in a single step on a live system — add the new one, backfill, switch reads, then remove the old one across separate releases.
A checklist before you ship
Confirm readiness checks gate traffic; confirm the new release is backward-compatible with the currently-running one; run schema changes as additive migrations ahead of the code that needs them; make sure in-flight requests are allowed to drain before an instance is killed; and rehearse the rollback so it is a known, boring step rather than an improvisation. Get these five right and the deployment strategy you choose becomes a detail rather than a risk.
Have a project in mind?
We design, build, host, and operate software end to end. Tell us what you need and we'll reply by email.
Get in touch →