Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 15+ years of experience. She is an advocate at Honeycomb.io for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights. She lives in Brooklyn with her wife, metamours, and a Samoyed/Golden Retriever mix, and in San Francisco and Seattle with her other partners. She plays classical piano, leads an EVE Online alliance, and advocates for transgender rights as a board member of the National Center for Transgender Equality.
Danyel Fisher's work centers on data analytics and information visualization. He focuses on how users can better make sense of their data; he supports data analysts, end-users, and people who just happen to have had a lot of information dumped in their laps. His research perspective starts from a background is in human-computer interaction, and brings together cutting-edge data management techniques with new visualization algorithms.
All feature flags that are set in our variables file that we don't then need to track and change and rip out every single time that we want to turn it on or off. Similarly, we can also quarantine bad traffic. If one user's traffic is anomalously slow or is crashing or causing issues for other users, we can segment it to its own set of servers, which we've spun up with a simple feature flag, and that feature flag controls both which paths are routed as well as the number of servers allocated and even what build IDs are assigned. So, therefore, we can set up a special debug instance for the traffic that's causing problems so that we can really investigate and get to the bottom of it without impacting other users. We also use our continuous integration of our infrastructure code to have confidence that what's running in production actually matches what's in our config so we can feel free to delete things that are not in our configus code as well as remove any used bits of configus code knowing that there are no hidden dependencies. But sometimes, this doesn't go entirely according to plan. So Danyel is going to tell us really quickly about a outage that we had and what lessons we learned from it. - Back in July of this year, well, here's a graph of. Honeycomb's performance of July 9th of 2019. As you can see, there was some sort of blip at a little after 15 o'clock, but that can't have been a very big deal, right? Well, let's zoom in a little and go see how that actually looks. It looks like at about 3:50. PM, things started going badly, and by 3:55, whatever we are measuring here was down to zero traffic. It stayed down for a good 10 minutes until about 4:05 when we were able to finally bring it back up. This is clearly bad, but how bad was it? Was this just a few-minute blip, or is this a terrible company-wide disaster? It's worth evaluating the notion of how broken is too broken? We quantify that with the idea of service level objectives. Service level objectives are a way of defining what it means for a system to be as successful as you want it to be. They're a common language that engineers can share with managers. For example, management might set a goal for what they want the reliabilities of the system to be, and engineers can figure out how they want to deploy their effort level to make sure that they maintain that level of reliability 100% is an unrealistic number. No system will be able to stay there. And so if you can come to an agreement on how close you need to get, then you can build a much more powerful and successful system. SLO math, in the end, is actually super simple. We count the number of eligible events we've seen, how many things we're interested in. For example, we might decide that we're really interested in how our system serves HTTP requests, so we'll filter ourselves to only looking at HTTP requests. And then of them, we'll define successful events. For example, we want events that were served with a success, with a code of 200 and in less than 100 milliseconds. Once you've got a pool of successful events and a pool of eligible events, then you can simply compute the ratio. We define availability as the ratio of good to eligible events. That's fairly straightforward math. And the wonderful thing about that, we're able to use a time window and a target percentage to be able to describe how well we're doing. So for example, our target was 99% over the last month. Now the wonderful thing about combining those two is that gives us the idea of an error budget. We can subtract the number of events that we are allowed to have gone wrong, that is to say the percentage of unsuccessful events divided by the total number that we've had, to figure out how much flexibility we have. Sometimes we might be very close to or over our budget, in which case we really should prioritize stability and making sure that systems are reliable. But sometimes we have some error budget leftover and that actually allows us to experiment or to have a higher change of velocity because when you've got an error budget, you can actually describe how acceptable it is for your system to not quite work within, to not quite always succeed. We use the notion of SLOs to drive alerting on our system. When we see that an SLO is just about to burn down, we'll page engineers and so they can act before we run out of budget. At Honeycomb, we'd done an exercise where we actually estimated out what we wanted our SLOs to be. And we realized that we have three major sets of features. We want to store all user incoming telemetry. We, in fact, have a 99.99% ratio on that because it's really important to us that user data not be lost. In contrast, we want our UI to be responsive, but we're willing to be a little bit more forgiving about that. Our default dashboards should load in one second. We're even more forgiving of our query engine because sometimes users do execute particularly complex or difficult questions, so we'll place that at 99%. So now to evaluate how that 12-minute outage looks, we really need to understand what sort of data we were seeing. Unfortunately, this is a graph of user data throughput which means that this 12-minute gap was not only a gap for us, but it actually shows on every one of our users' dashboards because, for those 12 minutes, we didn't accept their data. We dropped customer data. We were able to catch that this had happened and we rolled back. Liz said earlier that it takes about 10 minutes for rollback and that's precisely what it took here. During that time, we communicated with our customers both first to notify them that there was an outage and then that it had been repaired. Over that time, we burned triple our error budget. What do you do in this sort of situation? Well first, we halted deploys. We stopped making any more changes until we felt that we were reliable. And then we stepped back to look at how it had happened. It turns out, when we trace this down, an engineer had checked in code that didn't build. Having successfully found the root cause we fired them on the spot and washed our hands... (person laughing) Okay, fine, it turns out that checking in code shouldn't, of course, be a big deal because as I've talked to you about the CI system. Unfortunately, at the moment, we were playing with experimental CI build wiring, which happened to be willing to show a green button even for code that crashed. Of course, that's not a big deal because it was generating zero-byte binaries, which of course would get stopped. It turns out that our scripts weren't watching for that condition and were very happy to deploy empty binaries. And at the time, we didn't have a health check or an automatic rollback so that when this happened, our system just very happily went down. That put us on a mission to reprioritize stability. And over the next few weeks, we mitigated every one of those key risks, making sure that our CI system was catching all, was catching all these situations, making sure that it never deployed zero-byte files and making sure that end-to-end checks would succeed before the deployment would continue. Feeling secure and reliable, we were able to resume building. - So what's ahead for us because that clearly isn't the end of our mission. Well, what's ahead for us is continuing to be reliable and scalable and lead the industry of observability by being able to give customers high confidence in us and give them the features that they need. That means that we need to launch services easily. For instance, that refinery service that Danyel talked about, we needed to not just scale-up existing microservices but provision new microservices while maintaining confidence in our systems. We also needed to spend less money in order to pass savings on to our customers with a new pricing model. So this meant that we needed to adopt spot instances in order to scale-up without increasing cost dramatically, as well as introducing ARM64 instances, which offer a lower cost and therefore enabling us to offer a good service to our users at reduced prices. We also are going to continue modernizing and refactoring because continuous integration and delivery are a thing that all of us are learning and there's new best practices that emerge all the time. But above all else, what we prioritize at. Honeycomb is our employees. We want our employees to be able to sleep easily at night, and that means doing retrospectives every time we wake someone up to make sure that that's not going to happen again in the exact same ways. This isn't just something that startups can do. You can do this too, step-by-step, if you start measuring the right things and improving them where they matter. So we'd encourage you to read more on our blog at honeycomb.io/blog, where we talk about many of these things, including some of the lessons that we learned, and give you peeks behind the scenes at how Honeycomb runs and what our engineering practices are. So do what we do, understand and control your production environments so you can go faster while maintaining stable infrastructure. Don't askew risk, instead, manage it and iterate and always make your systems better, learn from the past and make your future better. If you're interested in learning more, you can go ahead and go to honeycomb.io/liz and get a copy of these slides and as always, thank you for your attention. - Thank you for joining us. (upbeat electronic music)