7 min read
Last updated:
The transition from MVP to production is the moment most engineering organizations either earn their next round of growth or quietly accumulate the debt that will define the next two years. The work is unglamorous. It is also non-negotiable. The MVP got you to product-market fit. Production keeps you there when traffic doubles, a region goes down, and a security researcher emails the CEO about an exposed API at 11pm on a Friday.
This checklist is the one we use during readiness reviews. It is opinionated about what matters, what can wait, and what is so often skipped that it has become the leading cause of preventable post-launch incidents. Treat it as a sequence, not a menu.
Observability Before Anything Else
You cannot operate what you cannot see. Before you take real customer traffic, three signals must exist for every service: structured logs with a request ID that propagates across service boundaries, metrics for the four golden signals (latency, traffic, errors, saturation), and traces that cover at least the critical user paths. The specific stack matters less than the discipline. Datadog, New Relic, Grafana Cloud, Honeycomb, and the open source combination of OpenTelemetry plus Prometheus plus Loki plus Tempo all work. What does not work is relying on print statements and hope.
The honest test of observability maturity is the time it takes a new on-call engineer to answer the question: what changed in the last hour, where is the error coming from, and which users are affected. If the answer takes more than 10 minutes, the dashboards are not real yet.
Alerting With a Signal-to-Noise Discipline
Most production failures are not about missing alerts. They are about alert fatigue. The MVP team that wires up Slack notifications for every error code is the production team that ignores the one alert that mattered. The discipline is to alert on symptoms users feel, not on every internal anomaly.
- Page-level alerts: error rate above SLO, latency above SLO at the 95th or 99th percentile, key user flow failure rate, payment processing failures, authentication failures above baseline.
- Ticket-level alerts: capacity headroom dropping, certificate expiry approaching, dependency end-of-life, anomalous spend.
- Suppress: per-instance crashes that auto-recover, transient downstream errors below threshold, log-level errors that are not user-visible, anything that fired more than three times in the last week without an action being taken.
Every page should have a runbook. Every alert that fires more than twice without action should be deleted or downgraded. The goal is that when a page wakes someone up at 2am, they trust it.
On-Call That People Will Actually Honor
On-call is a contract between the company and its engineers. The contract has to be sustainable or it will be quietly broken. A workable on-call rotation in a small team has at least four engineers, ideally six. Rotation is one week at a time, with a primary and a secondary. Pages outside business hours are tracked and reviewed. If pages exceed two per week pe
PagerDuty, Opsgenie, Incident.io, and Rootly all do the job. The tooling is not the limiting factor. The limiting factor is whether leadership treats on-call as core engineering work or as something that happens after the real work. Compensation, time-off, and the psychological weight of being the person responsible all need to be acknowledged explicitly.
Runbooks That Survive the First Real Incident
A runbook is not documentation about how a system works. It is a sequence of steps a tired engineer can follow at 3am to restore service. The format that survives reality includes the alert it responds to, the symptoms to confirm, the immediate mitigation steps, the diagnostic queries that confirm root cause, the rollback procedure, and the escalation contacts. Three sentences of prose at the top explaining what the system does is enough. Anything more becomes outdated and ignored.
The honest test is whether someone who has never seen the system can follow the runbook to recovery. If the runbook reads “check the logs and figure it out,” it is not a runbook.
Capacity Planning Without a Spreadsheet Cult
You do not need a formal capacity model in the first year. You do need to know three numbers: peak traffic in the last 30 days, headroom on every tier of the stack, and the time it takes to scale each tier when you need to. For most cloud-native architectures, this means knowing your database connection limits, your worker pool sizes, your rate limits to downstream APIs, and the cold-start time of your autoscaling groups or container platforms.
The two failure modes to avoid are scaling everything to the moon (expensive and hides design problems) and assuming autoscaling will save you (it does not when the bottleneck is a relational database, a third-party API, or a queue depth limit). A 30-minute monthly review of headroom is enough discipline at this stage.
Security Review With Real Coverage
Pre-production security is the place teams skip the most and pay the most. The minimum viable review covers authentication and session handling, authorization on every endpoint that touches customer data, secret management with no credentials in source control, dependency scanning for known CVEs, container image scanning, network egress controls, encryption at rest and in transit, audit logging for sensitive actions, and a documented disclosure policy with a real inbox.
External penetration testing is worth the cost before any launch that involves payment data, healthcare data, or anything regulated. For everything else, an internal review with someone who knows OWASP cold and a tool like Snyk, Semgrep, or Trivy in CI catches most of the preventable issues. SOC 2 readiness is a separate workstream and does not belong in the launch checklist unless customers are explicitly asking.
Backup, Restore, and Data Recovery That You Have Actually Tested
Backup
For object storage, versioning and cross-region replication are cheap and worth enabling. For databases, point-in-time recovery should be enabled by default on RDS, Cloud SQL, Aurora, or whatever managed offering you use. Self-managed Postgres without tested PITR is an outage waiting to happen.
Blue-Green or Progressive Deployment
Production deploys should not be a stop-the-world event. The minimum acceptable pattern is rolling deployments with health checks and an automatic rollback trigger. The better pattern is blue-green or canary deployments where new code receives a fraction of traffic before full cutover. Argo Rollouts, Flagger, AWS CodeDeploy, and the deployment primitives in Cloud Run, ECS, and Kubernetes all support this. Choose one, automate it, and make rollback a single command. If a senior engineer cannot roll back a bad deploy in under three minutes, the deployment story is not production-ready.
What to Defer Without Apology
Equally important is the list of things that look mature but actively waste time at this stage. A formal SRE function with error budget policies, SLO documents, and capacity simulations is overhead before you have meaningful traffic. A unit test suite at 90 percent coverage is overhead when integration tests on critical paths catch most regressions. Multi-region active-active is overhead when a single region with backups satisfies your SLA. A service mesh is overhead when your service count is in the single digits. A custom internal developer platform is overhead when a well-curated CI template does the job.
The principle is to invest in the operations that pay off when the system is under stress and to defer the operations that pay off when the system is large. Conflating the two is how engineering teams build infrastructure that looks like Google’s and supports the traffic of a county fair website.
When This Checklist Applies
This checklist is sized for SaaS products in the seed-to-Series-B range with engineering teams between 5 and 50 people, taking real customer traffic for the first time. It assumes a cloud-native stack on AWS, GCP, or Azure with managed databases. It assumes you are not yet handling regulated data at scale.
When It Does Not
It does not apply to regulated workloads where compliance is the gating constraint. It does not apply to embedded systems or hardware-adjacent products where the deployment model is fundamentally different. It does not apply to internal tools with five users where most of this is overkill. And it does not apply to the rare engineering organization that has done this before and already has the muscle memory. For everyone else, this is the work that turns a launch into a business.