Toggle TVRight arrowEvent Replays
Right arrowWhat To Do When You Blow Your Service Level Objectives
Backspace icon
Search iconClose icon

What To Do When You Blow Your Service Level Objectives

What do you do when you've had a few too many incidents and blown your error budget? Or had a pile of near-misses that burned the team out even though the user-facing SLO wasn't violated? What if the incident trigger was the infrastructure refactoring meant to improve, not harm, reliability & maintainability?

In this talk, two senior SREs describe the context for two sets of outages that caused our medium-sized startup to pause and re-evaluate our infrastructure plans. In one outage, we experienced a significant event before an immovable external deadline, and found a creative way to push the launch related risk to a separate shard of our infrastructure and de-risk the rest of the SLOs. In the other outage, we scaled back the ambitions of a refactor of our Kafka cluster in order to give the team a break from incident fatigue despite the fact that our SLOs had only partially burned.

Session Outline:

  • A tale of two incidents
    undefinedundefined
  • What we learned
    undefinedundefinedundefinedundefined
Previous
Next
Previous
Next

Sign up for our newsletter

Get tips and best practices on feature management, developing great AI apps, running smart experiments, and more.

Subscribe
Subscribe