(upbeat music) - Hi, this is Understanding Feature Flag Performance with Observability. My name is Austin Parker. I'm principal developer advocate at LightStep. If you would like to find me on Twitter, and I recommend you do, it's @austinlparker. You can also email me firstname.lastname@example.org if that's how you like get in touch. Finally, if you're interested in learning more about some of the things. I'm gonna talk about today, like distributed tracing, I wrote a book with a couple of my friends called "Distributed Tracing in Practice. " You can find it published by O'Reilly wherever books are sold.
Let's start out today by talking about observability. It's in the title of the talk, might be good to understand a little bit about what this is about. And when I say observability, people will hear a lot of different things. They'll think, oh, I've heard of that. It's it's when you have logs and metrics and traces or oh, observability, that's the new pizza place down in the corner. Well, what I'm gonna tell you is that observability is a lot of different things. But fundamentally it's two things. One, it's a process, it's a set of processes that you can bring into your organization, bring into how you do software engineering that help you understand your system. The second thing that observability is is that it's not something that you can buy. You can't go pay money and have someone back a truck of observability up to your loading dock and now you've got it. There are a lot of tools that are useful to help you on your observability journey. But really when you get down to brass tacks, observability is about the people, both the people that are building your system and also the people that are using your system. So a lot of times it's easier to see if you have observability, not by the tools you have or the data you're getting, but by looking at the results of what you have. So if you can take some arbitrary effect in your application and your distributed system, and then you can navigate backwards from that effect to what caused it, then you have observability. That's the way you can tell if you have the right processes and the right practices. It's a really, really amazing thing once you get your feet wet in it. One question I get a lot though is what's the difference between observability and monitoring.
Now, there's a quote I like from The New Stack and it goes like this, monitoring as a discipline is a way to you predefine normal and sort of freeze everything you're doing around that idea of normal. And I think that's really accurate because if you think about what you're doing when you're monitoring, you're basically taking a bunch of signals about your application, about your system, and you're saying, well, I think this is what's normal and this is what's not normal. I think that if I have CPU utilization over a certain amount for a certain amount of time, that's bad. Or if I have requests that are over a certain or under a certain amount on this API for so long, that's also bad. So I'm gonna set an alert and if that alert goes off, then oops, I'm out of compliance. I need to go remediate something. But if you think about your software and you think about how your software runs, what you'll find is that there's a lot of things that you wanna know. And all of those things blend together and interact in strange ways that maybe you haven't thought of when you were sitting there building your dashboards. The data that you were looking for is hidden everywhere. It's hidden in your application logs. It's hidden in application metrics, it's in your infrastructure. It's down in Kubernetes or in your virtual machine log directories. It's also things that matter maybe to people outside of engineering. It's in the analytics and the session data about how people are moving through your product and using it. It's also in things like feature flags. As you are creating these sort of feature flags that let you do permutations of your application state, that's also really important information. And as those flags change, and as they turn on or off, that's going to cause unexpected and interesting outcomes in how your software works that you'll need to be aware of. Observability helps with all of this, not by giving you another tool or another dashboard, but by providing a comprehensive and holistic approach to understanding your system. All of your system, not just individual parts. It helps you focus on the necessary cultural and process changes required in order to implement observability across your entire engineering stack.
Now, I'm gonna talk about a lot of different things in this presentation. But one thing that's gonna keep coming up as I talk about observability is telemetry. So I wanted to find a couple of things. The first thing I wanted to find is tracing. Now this can be, you might've heard of this as distributed tracing or distributed request tracing. And if you look at the slides here, you'll see that a trace is some way to model a request from stem to stern basically, from client to server. Each trace is comprised of multiple spans. Spans are a unit of work. They represent the work being done by a service or part of a service as it contributes to that overall request. Spans can contain attributes and events or tags and logs, depending on which particular tracing system you're talking about. And those tags and logs are used for, the tags are used to help you filter and sort through and curate your traces. The logs are more to tell you what happened during that trace. They're like a log in any other sort of logging system. They give you information about what happened. So let's keep this in mind because I'm gonna keep talking about using traces, using spans, things like that throughout the rest of the presentation. And distributed tracing is really a core part of observability, which is why I bring it up. If you wanna have observability, one thing that you're required to have is telemetry data. And that telemetry data can take the form of traces. It can also take the form of metrics or logs. And all of these different forms of telemetry data are really convertible into another format. You can take a log file and you can make it into traces. You can take traces and make them into metrics.
At LightStep we've built a system that uses distributed tracing to help you understand your system. And we use trace data as a core part of that. We build really powerful analysis features on top of it. And since we're pretty good at reusing what we do, we use LightStep to monitor LightStep. And we also use LightStep to monitor Lightstep's usage of feature flags. So we have observability into our feature flags. So let me tell you a little bit about how all that works together. The key thing about feature flags when it relates to observability is I think very much conceptualized in this word from the modern poet Pusha T, which is "If you know, you know. " And if you know what complexity does to a software system and you know what unbounded permutations of complexity do to your understanding of a software system, you'll quickly realize, I think, the value of distributed tracing and observability for understanding your feature flags. When you add feature flags, at first you had one or two. And it's not that bad. You've got A and B, and some people are gonna get A, and some people are gonna get B. But as they get more popular and you start adding more and more and more, you start to get into the situation where now I've got A and B not C, but yes, D, and then also E, G, H, F, foo, bar, baz, so on and so forth. And there's all this complex permutations of your system state that now becomes extremely unmanageable. So let's look at this very simple example where I have a flag foo and a flag bar, and I have this adorable dog. Something has happened to a user and that user is being served foo and bar. Why is it happening? Maybe I'm getting paged. I need to figure this out at three in the morning. So the problem could be in some strange change that was made under the foo flag. But it could also be because something was changed under the bar flag. It could also be some strange thing from the combination of foo and bar that I didn't realize because I was developing foo and bar independently of each other. However, there's more than just foo, bar and our adorable dog to deal with.
There's also Kubernetes or whatever deployment system I'm using. Whatever underlying infrastructure I'm using. Did something go weird there? I don't know. But that's information that I'm probably gonna need to find out really quickly to understand why I'm having an outage. Beyond even my deployment architecture, the things that's running my service, I could have the whole ding dang dong cloud to contend with. And not even just the resources that I control, but external APIs, managed services. Things that are completely outside of my purview. All those people are probably on the cloud too. They could be having problems and I could be experiencing that. One thing we like to call this this idea that there's a division between the things you can control and the things you're responsible for as a pyramid. And the stuff that you can control is way up here, but you're actually responsible for everything else in that pyramid, because all the things below your service here are the tip top of the pyramid. those are things that can impact you. So you have to be aware of all this. You have to be able to really easily say where in this pyramid of interconnected services, this deep system is the problem occurring? Observability helps you answer those questions and more. If I'm a developer, observability is gonna help me answer questions like I'm making changes to an API and I'm putting it behind a feature flag. I wanna understand what the performance difference is. Am I using more memory or less memory? Did latency go up or down? Did it stay the same? Is it different depending on some other factor, like a particular user or a particular time of day or a particular region that's being accessed? If I'm an SRE or my job is to maintain reliable systems, then I'm getting paged. I really need to know in under a minute, where is this problem coming from? How can I cut through all the noise and find the signal that really matters to help me understand why my system is gone out of its intended state? And if I have feature flags, how can I pin down to just the change that matters, just the flag that matters? One thing that gets lost a lot of times when we talk about observability though, is traditionally people will think, oh, it's just this it's dev and ops. Those are the people that care. But there's a third person that cares, or a third group of people that should care and it's sort of the businessy people.
Observability, isn't just for programmers, it's not just for SREs, it's also for your PMs and your CTOs and your people that are planning and trying to understand how much are we using in terms of cloud resources so we can save money later? Or, hey, we implemented this new feature flag. What's the result? How is this impacting other stuff? Are we getting more conversions? Are we having a better end user experience? Are people happy? Observability, maybe can't answer the question of if people are happy, but it can probably help you answer if they're happy using your website. So as a observability company, and as a user of feature flags, I wanna actually show you how we're doing this, and how you can start to do this too if your own software and it's actually pretty straight forward. So we started using LaunchDarkly about just over a year ago. And on average, we have about 50 feature flags in production. We use those in a variety of ways. For features that are in development in staging or pre-production, we will often use feature flags to target just the developers that are working on them. For features that we wanna roll out to our customers, but maybe people that are in early access will target a group of customers and we will roll it up to them using feature flags. We've also used it a lot for AB testing and making sure that things work right when you expect... Like, hey, we rolled out a new tutorial. Let's see if it works better with this language or that language. These are just some of the uses of feature flags. Now, one of the challenges of integrating feature flags into your software is, again, needing to know what's happening.
At LightStep, we were very fortunate. We already have this shared tracing layer for all of our backend and front end components. And I wanna go into this a little bit and explain how it all works before I show you some code. So if you look at my little architecture diagram here, I've got this feature flags client. and this client is just a pretty thin wrapper around the LaunchDarkly/Go client. The vast majority of our backend is written in Go. In that a rapper. We also bring in a tracing tagging and logging library. Then whenever a service, like our histogram service, which is called Live View, or our API layer, which is called Cruton, whenever they wanna evaluate a feature flag, they're actually calling into our client wrapper rather than directly to the LaunchDarkly client. And the reason why is that whenever a flag gets evaluated, we're able to look at the span that is currently happening in Cruton let's say, and add in appropriate tags and logs to tell us what's happening later on. So this is an example of getting a Boolean flag in our wrapper. I've highlighted some of the details here just to make it fit on a slide, but I wanna point out a couple of things. You should see a instance of trace dot log, trace dot SetTag. So we're able to just automatically, without the developer having to really know what's going on, they're just programming like normal.
They say, oh, I need to get a feature flag. So I'll get a Boolean flag by this name. Our tracing system is automatically looking and saying, oh, okay, well, I'm gonna log if something goes wrong. If I can't get an instance to the LaunchDarkly client, I'm gonna log that. And then when it actually evaluates the tag, we set two interesting things. We set both a well known flag called feature flag and then we add the flag name to that tag as a value.
Well, we've got all this wonderful information. We can use LightStep to track it down and we can use them to evaluate what our flags are doing. We can discover interesting insights and discover why a feature might be performing well or poorly for a given user. So in this screenshot here, I'm actually analyzing the tail latency of a flag or of a service called historian under a flag named Enabled lightweight string creation. So this is what powers the key operations feature on LightStep and lets us create basically time series graphs of traffic through your. API APIs or end points. Now I can look for cases where this is always true. And I can then use other information, other tags that are on those spans in order to come up with interesting and potentially useful information for me, if I'm trying to debug an issue or I'm trying to respond to an incident. So in this case, I can look at our correlations and see that one particular user ID is heavily correlated, or one particular project ID is heavily correlated with people that have this flag enabled or accounts that are under this flag. And you can see it in the analysis view there where, yeah, there's some pretty 28, 17 seconds. There's some people that are taking a while for that end point to come back. So I can not only say, hey, people with this flag this one particular user with this flag is having a really bad time sometimes, it also lets me say all these other people probably aren't, and that's also useful information. If I'm already investigating something, if I'm already like looking at a specific trace, then it's easy for me to see what's evaluated in real time by looking at the details of my span. So here you can see what I was talking about earlier with the flag evaluation where not only do I have a feature flag tag, I also have the value of that tag, I'm sorry, value of that flag as it was evaluated at the time the span was created. Now, you might be saying to yourself, wow, that's all really cool stuff. How can I do that? Well, the good news is it's actually pretty straight forward.
But I wanna show you what it looks like in LightStep. So in this case, we can see our set BG color span and then we can see flag dot new background was set in LightStep to false, because at this point I had turned it to false And you can do that for every single request. You'll see that for every single request that comes through. You would see what was the flag evaluated to what was the name of the flag? And the options are really endless from there. This is a very trivial example, obviously. The difference between this and something maybe more complex is maybe more complex than I would like it to be. But a lot of that comes down to feature flags, they are what you do with them.
If you have some really cool, complicated feature flags set, then it's going to take some more work maybe to instrument that for observability. So that's why I wanna leave you with is a couple of really closing thoughts and ideas on how you can use observability, what you can do with it and where you should go from here. So if you do go down this path and you start adding observability to your feature flags, then you should definitely remember add in information about your users to your spans. So either the user ID, the group ID, probably both. You wanna be consistent across all of your telemetry. So a user should be a user, should be a user, or you should have a way to clearly demarcate this user ID is their LaunchDarkly user ID. And the reason you wanna do that is so that you can really easily correlate these between LaunchDarkly and whatever trace viewer you're using. In a perfect world, this is all the same thing, and you don't have a different type of user ID for different types of users. But it's very it's very handy to be able to go from, let's say a span and see like user ID foo, bar, baz, and then go into some other tool and look up foo, bar, baz and it's like, ah, this is the same foo, bar, baz in both places. The second big takeaway is don't silo your tools. Don't think of feature flags and observability as something that's just for your devs or just for your SREs or whatever. Really sit down if you're planning this out, sit down and think, how can I build tools, how can I get tools that are maximally useful for most people? How is this something that I can... I shouldn't have to have this really complex insider knowledge. I shouldn't need to know all of these intricate details of my application, of how feature flags work. I should be able to communicate this to people that maybe aren't in the code every day and have them be able to go in and also get insights that are useful for their job.
Finally flags can really help you to narrow down user experience and really understand how people are using your software by letting you segment and profile people. But observability gives you that last mile of understanding how your changes are really impacting them. And that I think is super critical because if you're not developing software and you're not running software with your end user in mind, whoever they are, not just the person that's maybe sitting there at their computer using your site or sitting on their phone, using your tools, but the people that are relying on your software to work, your colleagues, other people at your company who knows. There's a lot of people in the world that use all of our software. And at the end of the day, those are the people that we need to serve with observability. So in conclusion, feature flags plus observability plus you equals happy users. Thank you very much. It's been great talking to you all. (upbeat music)