Why “100% Uptime” is the wrong goal and how to build systems that embrace failure.
Most organizations treat reliability like insurance: a policy you buy after the house is already built to protect against disaster. This is a fundamental architectural flaw.
In modern distributed systems, reliability is not an operational afterthought—it is a product feature, just like your search bar, your checkout button, or your dark mode. If you are waiting for the SRE team to “fix” your uptime after deployment, you have already lost.
The difference between a fragile platform and a resilient one isn’t the quality of their on-call engineers; it’s the quality of their design decisions.
The “Digital Janitor” Trap
In many companies, the relationship between Development and Operations follows a predictable, toxic pattern:
- Developers are incentivized to ship features as fast as possible.
- Code is thrown “over the wall” to production.
- SREs catch the bugs, patch the memory leaks, and wake up at 3 AM to restart services.
In this model, SREs become Digital Janitors. We are expected to clean up the mess so the business can keep moving. But this creates a cycle of technical debt that eventually grinds velocity to a halt. You cannot “ops” your way out of a bad architecture. No amount of Kubernetes autoscaling will fix a database schema that wasn’t designed for concurrency.
The Hierarchy of Reliability
Before we can talk about “Chaos Engineering” or “AIOps,” we have to respect the hierarchy of needs. Just as you cannot reach self-actualization if you are starving, you cannot build a self-healing system if you don’t have basic monitoring.
- Monitoring: If you can’t see it, you can’t fix it.
- Incident Response: When it breaks, is there a process, or just panic?
- Post-Mortem: Do we learn from failure, or do we just blame the intern?
- Testing & Release: Are we preventing defects before they hit the stream?
- Capacity Planning: Do we know when we will hit the wall?
Too many teams try to jump straight to the top of the pyramid without solidifying the base.
The Error Budget: Innovation’s Currency
The most controversial opinion I hold is this: 100% uptime is a terrible goal.
If you are aiming for 100% reliability, you are over-engineering. You are slowing down feature velocity to protect against a failure that users might not even notice.
Instead, we should view reliability through the lens of an Error Budget. Think of your system’s stability like a financial bank account:
- 100% Uptime is not the goal; 99.9% (or whatever your users actually need) is the goal.
- The remaining 0.1% is your Budget.
If your service is running perfectly, you have a “surplus” in your budget. This means you should take risks! Push that experimental feature. Update that legacy backend. If it breaks and you burn a little error budget, that’s fine—that’s what the budget is for.
However, if you are constantly crashing and have “spent” your budget, you are bankrupt. The penalty is a Code Freeze. You stop shipping features and focus 100% of your engineering time on stability until the budget recovers.
This shifts the conversation from “Dev vs. Ops” to a shared business decision: Can we afford to take this risk right now?
Three Ways to “Shift Left”
If we agree that reliability is a feature, how do we build it earlier?
1. Observability-Driven Development Developers usually write code, then tests, and then maybe think about logs. This is backward. “How will I know if this breaks?” should be one of the first questions asked. Metric emission should be part of the definition of done. If a function fails silently in the woods, and no dashboard catches it, it’s a ticking time bomb.
2. Paved Paths (Golden Signals) Don’t expect every developer to be a Terraform expert. Platform teams should build “Paved Paths”—pre-configured, secure-by-default infrastructure modules. If a developer uses the standard template, they get logging, alerting, and auto-scaling for free. Make the right way the easiest way.
3. Design for Graceful Degradation Assume the database will fail. Assume the third-party API will time out. When these things happen, does your user see a white screen of death, or do they see a “cached” version of the page? Resilient systems don’t just stay up; they break partially without collapsing entirely.
Conclusion
Reliability isn’t a tool you buy from a vendor. It is a culture. It is the acceptance that failure is inevitable in complex systems, and that our job is not to prevent every failure, but to manage it so gracefully that the user never loses trust.
Leave a Reply