All posts

reliability

99.9% vs 99.99% SLAs explained (with downtime math)

Uptimera team7 min read

"We're targeting four nines" sounds great in a board deck. It sounds different when you do the math: 99.99% uptime is 4 minutes 22 seconds of downtime per month — total. That includes every deploy, every database failover, every misconfigured DNS change, every cert renewal that didn't go cleanly.

This post walks through what each "nine" actually buys you, why each one is exponentially harder than the last, and how to pick an SLA you can actually keep.

The math: nines to minutes

Per 30-day month, here's what each tier allows in total downtime:

  • 99% (two nines): 7 hours, 12 minutes of downtime allowed per month.
  • 99.5%: 3 hours, 36 minutes.
  • 99.9% (three nines): 43 minutes, 12 seconds.
  • 99.95%: 21 minutes, 36 seconds.
  • 99.99% (four nines): 4 minutes, 19 seconds.
  • 99.999% (five nines): 25.9 seconds.
  • 99.9999% (six nines): 2.6 seconds. Effectively zero on any meaningful timescale.

What counts as downtime, anyway?

This is the question that turns SLAs from marketing copy into contractual landmines. Read any production SLA carefully and you'll find downtime is defined by what gets excluded:

  • Planned maintenance windows. Excluded if announced in advance — usually 48–72 hours' notice.
  • Customer-caused issues. Misconfigured DNS pointing at your service, hitting your own rate limits, etc.
  • Force majeure. AWS us-east-1 going down is usually upstream, not your fault.
  • Below a duration threshold. Many SLAs only count outages lasting more than 5 minutes.

Your monitoring tool, by contrast, sees raw availability — every timeout, every retried request, every flap. The number you publish to customers is almost always more generous than the number your monitors report. That's OK as long as the definition of "down" is documented and consistent.

Why each nine is exponentially harder

Roughly speaking, the operational cost of each tier:

  • 99% — "basically up." One server, decent monitoring, alert on death. Most side projects hit this without trying.
  • 99.9% — "competent single-cloud." Load balancer in front of multiple app instances, managed database with automated backups, monitoring with on-call rotation. This is the realistic ceiling for most SaaS on a single cloud region.
  • 99.99% — "active multi-region or multi-AZ." Failover paths actually tested. Database replicas. Deploys that don't cause noticeable downtime. Chaos engineering. An on-call team large enough to cover 24/7 without burnout. This is a different category of operational investment.
  • 99.999% — "the telco tier." Active/active across regions, dual vendors for critical dependencies, change-management processes that approach regulated industries. Almost no consumer SaaS actually delivers this in practice, regardless of what they claim.

What major providers actually commit to

For reference — these are all 99.9% three-nines, not four:

  • AWS EC2 Region SLA: 99.99% for a region (across multi-AZ); 99.5% for a single instance.
  • AWS S3: 99.9% standard, 99.99% target.
  • Stripe API: 99.99% target (very high in practice).
  • GitHub Enterprise Cloud: 99.9%.
  • Cloudflare: 100% for paid Enterprise (with credits below); SLAs vary by service.

If you build on top of a 99.9% dependency, you mathematically cannot commit to higher than 99.9% to your customers without either engineering around the dependency or accepting the credit risk.

How to pick an SLA you can keep

Three rules:

  • Measure first; promise second. Run multi-region uptime monitoring for at least a quarter before you put a number in a contract. If your current measured uptime is 99.7%, don't promise 99.99%.
  • Promise less than you deliver. Customers remember outages, not SLA credits. Publishing 99.9% while actually running at 99.97% is a competitive advantage that compounds.
  • Tie SLA to the things you control. Define the scope (which endpoints, which regions), define exclusions (maintenance, customer-caused), and define how it's measured. Vague SLAs become disputes.

Why this means you need a monitor

You can't hit an SLA you don't measure. The whole point of uptime monitoring is to give you the raw data — separate from your customers' complaints, separate from your incident channel — to know whether the number you're publishing is actually true.

That's the entire reason Uptimera exists: independent, multi-region uptime measurement that you can point at when the customer asks "are you really at 99.9%?"