• Overview
  • Transcript

Service Protection at Scale with LaunchDarkly

Maxwell Gerber Mulesoft

Service uptime is critical for maintaining customer trust. Protection mechanisms such as rate limiting help in maintaining uptime. LaunchDarkly can be used as an operational dashboard, with powerful tools to prevent downtime before it happens and resolve incidents faster when it does.

Downloads slides

Maxwell Gerber

Software Engineer, Identity and Access Management @ Mulesoft, Maxwell works on OAuth / OpenID and the policy engine for the Anypoint platform. NJ native, California transplant, UC Berkeley Alum Kombucha fermenter, long distance runner, amateur chef.

(upbeat music) - Hi, I'm Max Gerber, I'm a. Software Engineer at MuleSoft, where I spend most of my time on Identity and Access Management. Today, I'm here to talk about Service Protection, and how LaunchDarkly can be used to help you achieve your service protection goals. 

First off, what is service protection? Service protection is a huge field, I'm referring to techniques that a team can use to ensure that the web services they run see as little downtime as possible. Common service protection techniques at the infrastructure level, might include running read replicas, routing caching in front of your API. Today, I'm gonna talk about service protection techniques at the application level, things that you write custom code to implement. Specifically, I'm going to be talking about techniques that come in handy as your customer base grows. Growth happens along two axes, you'll have new customers using your service and existing customers store more data and interact with your service more frequently. Both of these have the potential to cause performance issues and downtime. This is sometimes referred to as the noisy neighbor problem. If you have one customer hammering your API, your other customers might notice the performance degradation, or you might end up in a full blown incident. 

Here's an example of service protection done poorly. Do you remember the Crypto craze back in the winter in 2017? Crypto kitties is a blockchain game that lets users create and trade digital cats on the Ethereum blockchain. The Ethereum network had a lot of trouble handling the increased load caused by the popularity of the game when it was first released. Existing users saw a massive slowdown in transactions and an increase in the costs they were paying per transaction. Now, this example isn't perfect. Ethereum is a distributed application. So effective service protection is a lot harder, most of us are running traditional centralized applications so we can do much better. Now, there are two techniques. I'm going to describe. Implementing one where both of these techniques gives you an upper limit on how much a given customer can stress system. The first and most common technique is Rate Limiting. Rate limiting is a broad category of measures to prevent denial of service due to excess calls on the system. A simple rate limiting policy might say that a client is only allowed to call an endpoint 20 times a minute. Rate limiting is important because each additional request eats up a little bit of your service. If a user decides to log in once, that's completely fine. If a user decides to send 10,000 login requests at the same time, that's probably gonna cause issues. 

The second technique isn't as well known, but is equally important. Collection Limiting is controlling the maximum size of customer data to prevent resource contention. One example of this in the wild, is it your Gmail account can only have up to 30 aliases configured. Is there a technical limitation behind this? Not really, but by capping each user to a known fixed amount, the people running Gmail know that nobody is going to create a user with 10,000 aliases, which might cause other parts of the system to struggle. If you have strict collection limits, you can do some cool things like calculate the maximum cost of a particular database join. Both of these techniques allow you to improve customer availability and reliability of your system, by preventing noisy neighbor effects. You know exactly how much a single customer can stress out your system. Of course, there are always issues with any new feature and service protection is no different. Service protection is almost always tied in the billing. The more a customer pays, the more resources they get. Plenty of SaaS companies charge per user, perhaps specific tiers. This means that setting your limits is often a group effort requiring input from engineering, sales and marketing departments. For engineers and Ops, you really want to be thinking about the operability of your system. Is it easy to tell when customers are hitting a limit? What do you do when this happens? Is upgrading the customer to the next tier a simple process? How do we audit the changes we make to customer accounts? When you end up running a rate limiting or collection limiting system, you'll find that it works great up until it doesn't. The last thing you want is your largest customer hopping on the phone to complain to your boss that they're hitting their limits. Or sometimes you set rate limits too high, and you need to reduce them to get out of an incident. Being able to update limits on the fly is a huge benefit. 

Now, some companies might think about how to automate all of this upfront. Adaptive rate limiting is a super interesting field of study. And you can find tons of white papers pretty easily. But automating fuzzy problems like this is really hard. You need to understand the business implications of changes, as well as the technical implications. It's much easier to have a human in the loop making the decisions. I hope you can see the thesis that I'm building towards. LaunchDarkly is a fantastic data store for configuring limits per customer or per two. LaunchDarkly standard pitch is that there a feature management system, but they're also a fantastic system for what they call operational flags, things that Control Key System behaviors that will never really go away. Permanent flags like circuit breakers. There are some great posts on the LaunchDarkly blog, so I won't go too deep into operational flags. The basic idea is this, write code to tie your business logic to limits, use LaunchDarkly to target limits for individual customers. Now, I'm gonna show a quick demo. The final functional project is available on my GitHub under MaxwellGerber slash Launchdarkly dash ratelimit dash demo. This is all the code we're gonna look at. 

Here, we're defining a rate limiting policy or middleware that can be attached to any API route. We're using a plug-and-play third-party library to handle the actual rate limiting logic. The only part we need to care about, is determining the bucket size, how many calls a user is allowed per minute. We do this by basically just calling LaunchDarkly. I want to point out our tenancy model, we have users each with their own unique user ID. Those users are grouped into customer accounts. We pass the customer account ID into LaunchDarkly as a custom attribute, which allows us to do targeting based on that as well. You could easily do targeting based on any other customer feature, if you attached a tier to them like free basic or premium, you can do targeting based on the tier. Now, here are variations set up in LaunchDarkly. We have four, low, high, unlimited and band. Now here are targets for individual users. This is my service and as an admin, I want to call it as much as I want. So, I've targeted my user ID for Max to be served the unlimited variation. Alice was one of our users, but she violated our terms of service, so we banned her for a week. Finally, we have a very important customer account, they pay us a lot of money, so we bumped up their limits. Any user attached to this account, is going to be served this higher tier. 

Now, here's where a lot of the value of the LaunchDarkly platform starts to shine. Their built-in audit log is wonderful, showing when changes were made, what the changes were, who made them and why, this is great, because when you're messing around with customer data, you need to keep a record of what happened. And when everything is on fire and you need to get out of an incident quickly, it's better to have the tools that do the tracking for you. Remembering to write stuff down doesn't always work. Looking over my audit log,. I see that six days ago, I updated the value of the band variation from zero to one. Does this make sense, probably not. Having an audit log is the best way to improve discoverability of issues like this. Well, that's all I've got. To recap, service protection techniques help preserve uptime by limiting the amount of resources any one customer can consume. Service protection rules are hard to set correctly and may need to be adjusted often. Using LaunchDarkly as a data source, and admin UI works great. They have a ton of built-in features that play nicely with this use case. Thank you. (upbeat music)