What To Do When You Blow Your Service Level Objectives
What do you do when you've had a few too many incidents and blown your error budget? Or had a pile of near-misses that burned the team out even though the user-facing SLO wasn't violated? What if the incident trigger was the infrastructure refactoring meant to improve, not harm, reliability & maintainability?
In this talk, two senior SREs describe the context for two sets of outages that caused our medium-sized startup to pause and re-evaluate our infrastructure plans. In one outage, we experienced a significant event before an immovable external deadline, and found a creative way to push the launch related risk to a separate shard of our infrastructure and de-risk the rest of the SLOs. In the other outage, we scaled back the ambitions of a refactor of our Kafka cluster in order to give the team a break from incident fatigue despite the fact that our SLOs had only partially burned.
Session Outline:
- A tale of two incidents
undefinedundefined - What we learned
undefinedundefinedundefinedundefined





























Related videos
Latest Videos
Sign up for our newsletter
Get tips and best practices on feature management, developing great AI apps, running smart experiments, and more.