Testing in Production without Compromising Availability
In May we brought our Test in Production Meetup to Paris, and our friends at Algolia were kind enough to host us. Algolia’s own Senior Engineer, Xavier Grand, spoke about how his team at Algolia tests when they have 4500 customers, in 16 regions, searching 14 billion queries per month.
“Testing in production is something you can do at any scale because you can start with a blue green deployment with only two instances on a load balancer in front. When you start by doing that at the beginning you will design your software in a way that it’s resilient and you will take technical choices that you wouldn’t without.”
Watch Xavier’s talk to learn more about why his team tests in production and how they do that without compromising availability. If you’re interested in joining us at a future Meetup, you can sign up here.
So, hello everyone, I'm Xavier Grand. A software engineer at Algolia, and I will speak about testing and production. Especially how to do it without compromising availability.
And now it should be good. Not more, so let's do without that. So, first, what is Algolia? So Algolia is a search API. We are available in 16 regions and in those 16 regions, we have 55 data centers to be able to provide the best availability possible.
This infrastructure is built on more than 13 billion queries per month and we now have more than 4500 customers.
So the agenda of today is pretty simple. So first we will speak about why do you want to do that, then how to do that properly. And finally how do you do that at Algolia?
So let's start with why you want to do that and why you don't want to test everything, which is simpler. And not again because just of the complexity. At Algolia we have more than 20 search parameters to configure the ranking. So if there were only booleans, it's already a million tests to write to test all the possible combinations. And so it will be really hard to implement all of them and so it will make your implementation slower just to write the test. Then once you're happy, you wrote them just to run them it's again a pain. So your implementation will be again slower just to validate your changes. And the code, it's tested … code so you have to maintain it. If you want to refactor something, you will also end up refactoring the test. And finally, something that is nearly possible to do is to test the preference, because you may optimize one request but it will maybe load your database and again the average response time will be higher just because of that. So you need to test on real production environment. And so to do that you would need to duplicate your infrastructure.
Then why you don't do everything in production because you want to make your customers happy and so you can't. And so the good compromise is to do both at the same time. So the code of the test are there to avoid to do the mistakes twice because during the implementation phase your customer will be happy to help you improve your product, but once they are in production they don't expect those issues to come up again. And also the test are a very good thing to validate on new implementation of a change. So you still need them. And when you are a new contributor on a big project, it's always good to have tests because as the main contributor it's easy to remember, "Okay, this function does that. Here is the preconditions for this function. I know how it works." For a newcomer, the only source of truth is the tests are passing on it.
And on the other side, you will leverage the test in production to discover the HKZs because okay you don't test the one million possibilities, but you will still discover them in production. So that's why you want that. And I'm sure that's where you would really be able to validate your performance optimization.
So once we know why we want to do that, now how to do that properly because testing in production is risky and you want to keep your service available. So the first thing is that the software design and the first thing to do is to don't trust anything because everything can break. So if, for example, you rely on other services, maybe they are also testing in production so they can return corrupted data. If you expect the … it will maybe end up broken and so if your application doesn't handle it then it can pop up to the end user and that's not what you expect. Then inside the code, if for example for a procedure you expect preconditions or a function expects for example an integer greater than zero, you should test it inside the function because maybe somebody else will use your function and use it with the wrong parameters. So it's always good to have those preconditions tested and if they fail, the best is also detailed error. So to lock something because the worse thing in production is to have something broken and don't know why it's broken.
And finally to test performance and usage, it's good to let your software produce metrics on some so then you will be able to exploit them after because you can guess that a change will improve performance if you don't measure it, it's not a really good thing. Then once you have the software ready, you need to have the infra to be able to test in production. So the first thing to do is to have at least two instances. One to run the stable code and the other one to test the new code that may break. And even if you have two instances, it's still good to have back-ups because, for example, if you have two front-end instances and they rely on the same database. If the new code corrupts your database or deletes it, okay, the other one will still be impacted. So back-ups are also something to consider when testing in production.
So once you have the software infrastructure ready, the last thing to do is to think about how do you want to deploy. And if, for example, you have multiple environments because for example you have one infrastructure for the internal development, one for staging, and one for production. You can limit the impact on your production by first deploying on testing, let your internal developers try it and break it. Then if it's fine, move it to staging. Let somebody else try to break it. And so it should be caught on the first or second step. And finally when it's in production, you are pretty confident that it won't break. And so with that you have progressive deployment because usually you will have few testing machines, a bit more staging, and lot of production machines.
The second thing to consider is to always deploy something that you can rollback because if, for example, you deploy something and you need seven hours, a day or two to fix then it means that your service will be broken during this period of time. While if you can rollback, you will just rollback on the previous version. Take enough time to fix it properly and then push it to production. Because a thing that you can do when you can't rollback is to push a quick fix that is even worse and so just push the issue further. And as I explained before for the logging, if you need to rollback it's also good to know why you need to rollback because you will have to fix your issue. So if you roll back and you have to deploy again to see, "Ah, the issue is actually there so now I can fix it." Okay, you lost. So the best is to have logs to know why it fails to debug quickly.
Then if your application is producing metrics and your monitoring is testing your API, you have multiple source of information. So if the monitoring is broken, the metrics produce will be second source of information so it's better to have multiple of them to be able to assess how your software works. And finally, if you can leverage that to automatically rollback, it will be faster than any human. Because the software will be able to rollback in a few seconds with the health check while just typing the command will be more than a few seconds. So that's how to do that. Now how we do that here.
So as I said, we are a search API and every time you subscribe to Algolia you have access to a cluster of three nodes. Each node of the cluster can answer independently to read operations. So even if you have two nodes which are broken, the third one will still answer your request. Then we decided to move retry strategy at the client level because since our clusters are spread across multiple data centers we didn't want to have response latency when you try on another machine. For our database, since we are a search engine, we have indices. And those indices are custom binary format. And when we have, for example, a customer updates a product, we don't spread inside the cluster the process data. We spread the diff so that if corrupts the index, it won't spread to the others. And then inside the code we have multiple checks triggering alerts so that when there is an issue, we know it directly after the deployment. And in the worst case where the program crashed, we'll have a code dump to have the complete stack of where it crashed and we'll be able to dig inside.
Then for the monitoring, we have three sources of information. The first one is the ping probes, so the goal of the ping probes is to check if this machine's available and if you are able to perform a read operation on it. The second one is the indexing operation, the role of this one is to check if you are able to send new data and replicate it inside the cluster. And the third one are time series metrics, so search capacity, number of incoming request, incoming indexing operations. And so based on all these informations we trigger alert if it goes above a certain limit. So then when you deploy in few minutes you are aware if something is broken.
And so the last thing is the deployment. To be able to track all the bug fixes on some, we only have two versions running at the same time. We have more than one thousand and five hundred servers, so it was really a good thing to have only two versions because you can quickly end up with five of them. And then when you have an error to remember which one fixed the issue, starts to be a pain. So that's why you have only two versions. The previously deployed and the one currently in deployment. And we release on those servers every week to keep the changes small because the bigger the changes are, the harder it is to find which one triggered the bug.
Then for the rollback, we always make the changes backward compatible. So that we don't have to go through migration processes. If you stop in the middle, you are screwed because you can't ever go back on the previous version or move to the next one. And since we have a binary format, we also need to handle changes inside this binary format. And so for that, we always go with two-step deployment where first we deploy the capability to read the new format and then we deploy the capability to write it. So, in that case, we can rollback at any step and we are still able to read the indices.
And finally, for the procedure, we have an automatic rollback. So a few seconds after the deployment, we'll perform all the health check to see that the API is running, that it's able to index on so on. And then for the procedure we'll explain it a bit more in detail. So first we divided in 10 different groups. So we are first all the testing machines. Then we have staging, self-serve, and enterprise and we divided those last three groups in three subgroups where each group is one of the node inside a cluster. And so then how do we deploy in those 10 groups. So first we'll deploy on testing. So testing we don't have any critical data. If we corrupt data, if we lose data, it's not an issue. It's only used internally with temporary indices. So we don't care to lose data there.
Then we deploy on staging, but only one one machine because here on staging we have data that we want to keep. And it's used still internally, but we don't want to lose the data. So we first deploy on the first machine then we deploy on self-serve and why we deploy on self-serve instead of enterprise is not because they pay less than enterprise. It's because on one self-serve cluster we have more than a thousand servers. So deploying on one cluster of our self-serve will test the same than deploying on one thousand servers of enterprise. So that's why we do that and also because if you deploy on one machine, we can easily fix this machine if we have an issue where if you have one thousand machines to fix manually, you won't be able to make it before a couple days. So that's why we deploy on self-serve and then enterprise to finally test all the use cases. And once the deployment is done on the three first machines, we'll do that on the second nodes, and then on the third nodes. And so with that, if you have an issue and it happens on a few hours or days, you will catch it on the first nodes before the last one is actually coded.
So to sum up about that, testing in production is something you can do at any scale because you can start with a blue green deployment with only two instances on a load balancer in front. When you start by doing that at the beginning you will design your software in a way that it's resilient and you will take technical choices that you wouldn't without. And it's exactly the same for your infra. Since we started to do that at the beginning, our infra is pretty resilient and the only place where we failed was mainly on the how to deploy progressively. And the good thing, is it makes the implementation phase faster because you write the test to test 90% of the case, then you test all the features used in production. And so you are able to iterate quickly with that because you are never afraid to break something because the infrastructure is resilient enough to handle it for you. And so that's it.