LaunchDarkly's Evolution from Polling to Streaming
When building software teams have to make architectural and design decisions, the decisions you make initially aren't etched in stone. As your company grows, those decisions may need to change. At LaunchDarkly, whether to build a polling or streaming architecture for flag evaluations is one of those decisions that has evolved over our six years.
Today, we evaluate over 3 trillion flags a day. When we started, it was slightly lower. As we have grown, how we provide updated flag information has evolved. In this article, we share how we moved from a polling architecture to a streaming architecture and how we addressed the build vs. buy question.
But first, some definitions. Polling consists of clients asking the server for new information at a fixed delay. With streaming, the server pushes information to connected clients when content has changed. Web applications were originally strictly polling. The client would request an item. The server would send it. As applications became more real-time, the need for pushing content to clients arose.
HTTP/2 proposed a PUSH methodology and recently a proposal for DNS servers to move to a subscribe/request method of getting updates as opposed to polling. The reasons for these proposed changes include delivering more timely updates with reduced CPU usage and lower network traffic. Some of the very reasons we moved from polling to streaming, but I'm getting ahead of myself here.
In the beginning, there was polling
Polling was the method of sending updates we initially deployed. Our daily flag evaluations weren't high. There wasn't much worry around overwhelming our servers with network traffic due to polling (that would come later). Another benefit of polling is if a client missed a message, they would get it the next time they polled for an update. Polling was the most straightforward approach that worked and enabled us to begin offering our feature management platform.
However, polling comes with some drawbacks. As we grew, these became something we needed to resolve. When you poll, you need to decide how frequently to poll and analyze the trade-offs between polling too frequently versus not often enough. Each poll requires additional round trips—TCP connection, SSL, data transfer. Set to run too frequently, these polls can drain the batteries for mobile users and result in increased bandwidth consumption. Set to run less frequently, you run the risk of users having stale data.
At LaunchDarkly, we use a CDN. As we grew, the costs associated with frequent polling increased. It was no longer feasible for us to have clients poll for updates – we had to find an alternative.
Streaming: build vs. buy
Knowing we had to switch to a streaming architecture was step one. The next step was deciding whether to build or buy? At this stage in our growth, we opted to buy and partnered with a third-party provider. The benefits of using a third-party included not having to divert engineering staff to work on building this functionality. At the time, the engineering team was four people, with no one dedicated to infrastructure/DevOps. We didn't need to invest in additional infrastructure changes and upgrades to ensure all infrastructure components supported long-lived connections. We were able to move to a streaming architecture and focus on building other core features of LaunchDarkly.
But as we continued to grow in size, issues began to arise, and we realized we had once again outgrown the architecture. At this point, we decided to build our own streaming architecture.
Initially, it was a hybrid model. We needed to treat our server-side SDKs differently from our client-side SDKs. There were concerns about sending proprietary information like targeting rules and logic to devices not in the control of our customers. For our server-side SDKs, we built a flag-update stream, which is still in use today. For client-side SDKs (mobile, desktop, browser applications), we initially built ping streams. We would send a message to the stream anytime a flag or the environment changed. This told all listening clients to make a poll request for the updated information. LaunchDarkly would evaluate the flags for each individual client ensuring we weren't sending proprietary information.
The ping-stream model significantly reduced the chattiness of the polling. However, it wasn't immune from challenges. When customers had large numbers of clients connected, they would all get the update simultaneously, and in return all request updates at once. If one million connected clients all attempted to poll for an update at once, our servers would be overwhelmed. We had created the ability to potentially DDoS ourselves—oops.
We introduced a random jitter to delay when clients would send a request for an update. While this solved one problem, it introduced another. It was taking longer to deliver updates to all users. In an ideal world, it would be useful to send already evaluated flags to clients instead of using a ping-stream model.
The notion of streaming evaluated flags to clients was scary as we have millions of client connections. To achieve this, we needed to keep the users' context in memory for the life of the stream. We categorized this idea under “It'll never work, but it would be great if it could.”
But then one of our engineers decided to investigate the size, memory constraints, and CPU. The analysis revealed it was realistically possible. This led to a conceptual plug-in and a lot of load testing to ensure we could handle the overhead. Today we have personalized flag streams, specific to individual users connecting to our service. This methodology allows us to send flag updates more efficiently and faster.
Additional innovations came from building our own streaming solution including the creation of our relay proxy. For customers with lots of SDKs, this meant a large number of streams being opened. Having so many open connections was problematic for the customer's NAT. Instead of each SDK establishing a connection with our services directly, they connect internally to our relay proxy. The relay proxy maintains a single connection to our services.
As we traveled down the road from polling to streaming, we thought about the impact it would have on our customers. Customers today see rapid updates when flags change with significantly less bandwidth consumption. The benefits of this architecture don't stop with our customers, but they continue down to their customers. The batteries on the mobile devices are extended as they no longer have to spend battery power on polling. With LaunchDarkly's streaming architecture, you can be confident your customers receive the most up-to-date flag configurations as quickly as possible.