Austin Parker has been solving - and creating - problems with computers and technology for most of his life. He is the Principal Developer Advocate at LightStep and maintainer on the OpenTracing and OpenTelemetry projects. His professional dream is to build a world where we're able to create and run more reliable software. In addition to his professional work, he's taught college classes, spoken about all things DevOps and Distributed Tracing, and even found time to start a podcast. Austin is also the co-author of the forthcoming book Distributed Tracing in Practice, available in early 2020 from O'Reilly Media.
(upbeat music) - Hi, this is Understanding Feature Flag Performance with Observability. My name is Austin Parker. I'm principal developer advocate at LightStep. If you would like to find me on Twitter, and I recommend you do, it's @austinlparker. You can also email me firstname.lastname@example.org if that's how you like get in touch. Finally, if you're interested in learning more about some of the things. I'm gonna talk about today, like distributed tracing, I wrote a book with a couple of my friends called "Distributed Tracing in Practice." You can find it published by O'Reilly wherever books are sold. Let's start out today by talking about observability. It's in the title of the talk, might be good to understand a little bit about what this is about. And when I say observability, people will hear a lot of different things. They'll think, oh, I've heard of that. It's it's when you have logs and metrics and traces or oh, observability, that's the new pizza place down in the corner. Well, what I'm gonna tell you is that observability is a lot of different things. But fundamentally it's two things. One, it's a process, it's a set of processes that you can bring into your organization, bring into how you do software engineering that help you understand your system. The second thing that observability is is that it's not something that you can buy. You can't go pay money and have someone back a truck of observability up to your loading dock and now you've got it. There are a lot of tools that are useful to help you on your observability journey. But really when you get down to brass tacks, observability is about the people, both the people that are building your system and also the people that are using your system. So a lot of times it's easier to see if you have observability, not by the tools you have or the data you're getting, but by looking at the results of what you have. So if you can take some arbitrary effect in your application and your distributed system, and then you can navigate backwards from that effect to what caused it, then you have observability. That's the way you can tell if you have the right processes and the right practices. It's a really, really amazing thing once you get your feet wet in it. One question I get a lot though is what's the difference between observability and monitoring. Now, there's a quote I like from The New Stack and it goes like this, monitoring as a discipline is a way to you predefine normal and sort of freeze everything you're doing around that idea of normal. And I think that's really accurate because if you think about what you're doing when you're monitoring, you're basically taking a bunch of signals about your application, about your system, and you're saying, well, I think this is what's normal and this is what's not normal. I think that if I have CPU utilization over a certain amount for a certain amount of time, that's bad. Or if I have requests that are over a certain or under a certain amount on this API for so long, that's also bad. So I'm gonna set an alert and if that alert goes off, then oops, I'm out of compliance. I need to go remediate something. But if you think about your software and you think about how your software runs, what you'll find is that there's a lot of things that you wanna know. And all of those things blend together and interact in strange ways that maybe you haven't thought of when you were sitting there building your dashboards. The data that you were looking for is hidden everywhere. It's hidden in your application logs. It's hidden in application metrics, it's in your infrastructure. It's down in Kubernetes or in your virtual machine log directories. It's also things that matter maybe to people outside of engineering. It's in the analytics and the session data about how people are moving through your product and using it. It's also in things like feature flags. As you are creating these sort of feature flags that let you do permutations of your application state, that's also really important information. And as those flags change, and as they turn on or off, that's going to cause unexpected and interesting outcomes in how your software works that you'll need to be aware of. Observability helps with all of this, not by giving you another tool or another dashboard, but by providing a comprehensive and holistic approach to understanding your system. All of your system, not just individual parts. It helps you focus on the necessary cultural and process changes required in order to implement observability across your entire engineering stack. Now, I'm gonna talk about a lot of different things in this presentation. But one thing that's gonna keep coming up as I talk about observability is telemetry. So I wanted to find a couple of things. The first thing I wanted to find is tracing. Now this can be, you might've heard of this as distributed tracing or distributed request tracing. And if you look at the slides here, you'll see that a trace is some way to model a request from stem to stern basically, from client to server. Each trace is comprised of multiple spans. Spans are a unit of work. They represent the work being done by a service or part of a service as it contributes to that overall request. Spans can contain attributes and events or tags and logs, depending on which particular tracing system you're talking about. And those tags and logs are used for, the tags are used to help you filter and sort through and curate your traces. The logs are more to tell you what happened during that trace. They're like a log in any other sort of logging system. They give you information about what happened. So let's keep this in mind because I'm gonna keep talking about using traces, using spans, things like that throughout the rest of the presentation. And distributed tracing is really a core part of observability, which is why I bring it up. If you wanna have observability, one thing that you're required to have is telemetry data. And that telemetry data can take the form of traces. It can also take the form of metrics or logs. And all of these different forms of telemetry data are really convertible into another format. You can take a log file and you can make it into traces. You can take traces and make them into metrics. At LightStep we've built a system that uses distributed tracing to help you understand your system. And we use trace data as a core part of that. We build really powerful analysis features on top of it. And since we're pretty good at reusing what we do, we use LightStep to monitor LightStep. And we also use LightStep to monitor Lightstep's usage of feature flags. So we have observability into our feature flags. So let me tell you a little bit about how all that works together. The key thing about feature flags when it relates to observability is I think very much conceptualized in this word from the modern poet Pusha T, which is "If you know, you know." And if you know what complexity does to a software system and you know what unbounded permutations of complexity do to your understanding of a software system, you'll quickly realize, I think, the value of distributed tracing and observability for understanding your feature flags. When you add feature flags, at first you had one or two. And it's not that bad. You've got A and B, and some people are gonna get A, and some people are gonna get B. But as they get more popular and you start adding more and more and more, you start to get into the situation where now I've got A and B not C, but yes, D, and then also E, G, H, F, foo, bar, baz, so on and so forth. And there's all this complex permutations of your system state that now becomes extremely unmanageable. So let's look at this very simple example where I have a flag foo and a flag bar, and I have this adorable dog. Something has happened to a user and that user is being served foo and bar. Why is it happening? Maybe I'm getting paged. I need to figure this out at three in the morning. So the problem could be in some strange change that was made under the foo flag. But it could also be because something was changed under the bar flag. It could also be some strange thing from the combination of foo and bar that I didn't realize because I was developing foo and bar independently of each other. However, there's more than just foo, bar and our adorable dog to deal with. There's also Kubernetes or whatever deployment system I'm using. Whatever underlying infrastructure I'm using. Did something go weird there? I don't know. But that's information that I'm probably gonna need to find out really quickly to understand why I'm having an outage. Beyond even my deployment architecture, the things that's running my service, I could have the whole ding dang dong cloud to contend with. And not even just the resources that I control, but external APIs, managed services. Things that are completely outside of my purview. All those people are probably on the cloud too. They could be having problems and I could be experiencing that. One thing we like to call this this idea that there's a division between the things you can control and the things you're responsible for as a pyramid. And the stuff that you can control is way up here, but you're actually responsible for everything else in that pyramid, because all the things below your service here are the tip top of the pyramid. those are things that can impact you. So you have to be aware of all this. You have to be able to really easily say where in this pyramid of interconnected services, this deep system is the problem occurring? Observability helps you answer those questions and more. If I'm a developer, observability is gonna help me answer questions like I'm making changes to an API and I'm putting it behind a feature flag. I wanna understand what the performance difference is. Am I using more memory or less memory? Did latency go up or down? Did it stay the same? Is it different depending on some other factor, like a particular user or a particular time of day or a particular region that's being accessed? If I'm an SRE or my job is to maintain reliable systems, then I'm getting paged. I really need to know in under a minute, where is this problem coming from? How can I cut through all the noise and find the signal that really matters to help me understand why my system is gone out of its intended state? And if I have feature flags, how can I pin down to just the change that matters, just the flag that matters? One thing that gets lost a lot of times when we talk about observability though, is traditionally people will think, oh, it's just this it's dev and ops. Those are the people that care. But there's a third person that cares, or a third group of people that should care and it's sort of the businessy people. Observability, isn't just for programmers, it's not just for SREs, it's also for your PMs and your CTOs and your people that are planning and trying to understand how much are we using in terms of cloud resources so we can save money later? Or, hey, we implemented this new feature flag. What's the result? How is this impacting other stuff? Are we getting more conversions? Are we having a better end user experience? Are people happy? Observability, maybe can't answer the question of if people are happy, but it can probably help you answer if they're happy using your website. So as a observability company, and as a user of feature flags, I wanna actually show you how we're doing this, and how you can start to do this too if your own software and it's actually pretty straight forward. So we started using LaunchDarkly about just over a year ago. And on average, we have about 50 feature flags in production. We use those in a variety of ways. For features that are in development in staging or pre-production, we will often use feature flags to target just the developers that are working on them. For features that we wanna roll out to our customers, but maybe people that are in early access will target a group of customers and we will roll it up to them using feature flags. We've also used it a lot for AB testing and making sure that things work right when you expect... Like, hey, we rolled out a new tutorial. Let's see if it works better with this language or that language. These are just some of the uses of feature flags. Now, one of the challenges of integrating feature flags into your software is, again, needing to know what's happening. At LightStep, we were very fortunate. We already have this shared tracing layer for all of our backend and front end components. And I wanna go into this a little bit and explain how it all works before I show you some code. So if you look at my little architecture diagram here, I've got this feature flags client. and this client is just a pretty thin wrapper around the LaunchDarkly/Go client. The vast majority of our backend is written in Go. In that a rapper. We also bring in a tracing tagging and logging library. Then whenever a service, like our histogram service, which is called Live View, or our API layer, which is called Cruton, whenever they wanna evaluate a feature flag, they're actually calling into our client wrapper rather than directly to the LaunchDarkly client. And the reason why is that whenever a flag gets evaluated, we're able to look at the span that is currently happening in Cruton let's say, and add in appropriate tags and logs to tell us what's happening later on. So this is an example of getting a Boolean flag in our wrapper. I've highlighted some of the details here just to make it fit on a slide, but I wanna point out a couple of things. You should see a instance of trace dot log, trace dot SetTag. So we're able to just automatically, without the developer having to really know what's going on, they're just programming like normal.