Vatasha is a software engineer at LaunchDarkly that focuses primarily on the backend. She attended Smith College and double majored in Computer Science and Engineering. Outside of work, she can be found stomping some arrows while playing Dance Dance Revolution.
(upbeat music) - Hi, I'm Vatasha. I'm a Software Engineer at LaunchDarkly. I'm super excited to share my talk about LaunchDarkly's architecture and the various components that have helped us scale. I'm gonna be talking to you about some of our metrics that show our scale. In 2013, we had 203 billion flag evaluations daily and now we have nearly 6 trillion daily flag evaluations. We've seen almost a 30X increase in daily flag evaluations. Last year, around this time, we were at 8.9 million client connections, which is a 35% increase year over year. Server connections have remained roughly the same and we're ingesting 19 terabytes of data daily. I'm gonna talk about four components of our architecture that have allowed us to scale. First, let's start with our streaming architecture. So why do we even do this? Well, we need to serve flag values to hundreds of millions of devices, where each device only sees the final flag value for that user to protect our customers' internal logic. In this diagram, you can see client-side. SDKs receiving updates from the streaming API, when a flag value changes. In the beginning, we didn't use a streaming architecture. We use polling instead. Polling was a viable solution, as our network traffic was much lower. So we didn't worry about overwhelming our evaluation API. In this diagram, you can see what polling looks like. The client-side SDK periodically calls the evaluation API for flag updates. The issue with this is that you're only receiving updates as often as you poll. So for flag value was changed five seconds ago and you poll for updates every minute, you're getting the new flag value at least 55 seconds late. So in order to receive updates, as often as they happen, we introduce pings. Which is outlined in the diagram. Pings tell the SDK to check for updates from the same API on demand. This was problematic because this told all clients to make a request for the updated information. If a customer had a million clients then all we're trying to poll to get the updated information at the same time. This is where we shot ourselves in the foot and overwhelm the evaluation API. This is real life footage of our evaluation service, when a million clients asked for updates at once. So to not overwhelm the evaluation API and to deliver updates on demand, we have a streaming service that does the evaluation and sends the new flag value back to the SDK. The stream service called streamer provides isolated entry points for client-side SDKs and server-side SDKs. When a flag is updated, the web application sends a message to a message broker. The streaming service called streamer transforms the message, the message broker receives. The message is then sent to client-side SDKs for whatever client is subscribed to listen to that publisher. As a result of the transformation, we do not send raw flag values to the client-side SDKs. As a result of our updated architecture, we're able to accomplish the following. Handle millions of connected clients at scale. Protect our customers' business logic. And propagate flat changes in 200 milliseconds. Next, I'm gonna be talking about summary flag events. So how do we capture the exact count of evaluations to each value for each individual flag? We use summary events. Here's a sample payload of feature events before summary events. Here's a sample payload of summary events. We're able to convey the same information using half the bites. Summary events are an aggregated count of flag evaluations. So let's take an estimate and say that each feature event has a size of 150 bytes. And there's an estimate of 250 bites per summary event payload. Let's imagine we have a thousand feature events. A thousand feature events would have a size of 150 kilobytes. Which is gonna be 600X bigger than a single summary event to represent the same information. Early on we talked about how we're evaluating 6 trillion flags daily. 6 trillion flags daily is about 900 terabytes of data. If we need to ingest 900 terabytes of data in a day, that would require us to ingest 10.4 gigabytes per second. Which is a lot. This is real-life footage of Kinesis trying to process 10.4 gigabytes per second. Next, I'm gonna be talking to you about analytics at scale and some of the things we do. So how do we capture the exact count of evaluations to each value for each individual flag? Events. Analytic events allow us to power many features of the application, such as the debugger, insights, flag statuses, customer metric, autocomplete of rules, experimentation and data export. The diagram above shows how use events to power the insights. The SDK sends events to the event API, which is called a event-recorder. Event-recorder sends the events to a Kinesis data stream. We then have several Lambdas that consume that stream. In Kinesis each record-event in our cases ordered, so we have the ability to replay data if we need to. Some common use cases for a replay would be if we shipped a bug that messed up how we store data or we could use the data to help debug a customer issue. With that being said, ER, event-recorder ingest up to 40 trillion events daily, which is a ton. That's why summary events are so critical for us. Next I'm gonna talk about streamers around the world. So how do we decrease stream init times in Europe and Asia? We expect stream init times to take less than 500 milliseconds and they were taking longer in these regions. We deployed streamer and our two caches in two new regions to decrease stream init times. In this diagram, you can see what that new deployment looks like. The end result of this initiative resulted in 90% decrease in stream init times for Asia and 50% decrease in stream init times for Europe. This diagram shows stream init times in our five stores regions before and after the project. You can see a substantial decrease in most regions around starting around like October, when we turn the flag on. The power really comes from the combination of all four of these components. We moved from polling architecture to a streaming architecture and this allows us to propagate flag updates in 200 milliseconds. We're batching and summarizing events and that allows us to ingest 23X of events daily. Events are a critical piece of our service that allow us to support data export and experimentation amongst others and deploying in two additional regions had hug performance wins for stream initialization times. So what's next? LaunchDarkly goes multi-region to move our offering closer to customers and provide additional resiliency. One of the core pieces of this is moving to a distributed database. Moving to a distributed database, moves the data closer to customers and we won't need to go across the globe to fetch flat configurations. This will also decrease stream initialization times and overall increase performance of the application. We will continue to keep our caches warm, but we will be less reliant on them. We're also gonna be adding support for large segments to make them more performant and easier to use. Segments are used to help you target a specific group of users that meet a certain criteria. For example, one could use a segment to target all users and a specific location. Today, segments have a limit on the number of targets and we're going to make this essentially limitless. This unlocks other capabilities, such as letting users customize their experience on the application and make things like dark mode possible. Thank you for your time. (upbeat music)