Transitioning Production Tests into a Synthetic Monitoring Solution

80
Jonathan Hare Winton speaking at Test in Production

In June the Test in Production Meetup went to London, and our friends at Intercom hosted us in their Shoreditch offices. Jonathan Hare-Winton spoke about how and why his team at The Guardian used synthetic monitoring.

“Whereas like, nowadays the teams that I’m working on everything was a micro-service, we were deploying probably 30-40 times a day across these various services. Prod is just like this moving target, it’s like whack-a-mole, so even if you run tests on newly deployed, one of your micro-services, you run tests on it and its great then, but then another service gets deployed and suddenly yours starts behaving differently.” – Jonathan Hare-Winton, SDET at Spotify

Watch Jonathan’s talk to learn more about what synthetic monitoring is, why you’d want to do it, and how you can take tests you have running in production and turning them into synthetic monitoring. If you’re interested in joining us at a future Meetup, you can sign up here.

TRANSCRIPT

Jonathan: Hi everyone, yeah my name is Jonathan Hare Winton. I’m a software engineer, I work in one of the quality teams at Spotify. I’m off brand today, this is not a Spotify talk, I’m not talking about anything I’ve done at Spotify.

This is going to be mostly about something I used at my previous job, which was at The Guardian, which is for those of you who don’t know, a big UK newspaper. And yeah we are going to be talking about synthetic monitoring, which is something you can do if you’ve got tests running in production like bot tests and things like that. Taking them a little bit further and doing a bit more with them.

So I’ve already covered who I am, so yeah production tests to synthetic monitoring. So we are going to talk about what actually synthetic monitoring is. Why you’d wanna do it in the first place, how you’d then go from taking tests that you’ve been running in production and turning them into a synthetic monitoring solution and then the challenges that go with that.

I’m going to have to talk very, very quickly to do this I think because I’ve got about 15 minutes. So this is a very long wordy definition of synthetic modeling, it’s also known as active monitoring or comparative monitoring. It’s a monitoring technique that is done by using emulation or scripted recordings of transactions.

So essentially what this is saying is it’s rather than, if you think of most forms of monitoring our systems will look at things like error rates, number of 500 errors, all that kind of thing. These are very passive monitoring, you’re reliant on things happening on your system for it to then tell you things about those. Whereas synthetic monitoring is the idea that you create these actions yourself for the purposes of monitoring, to work out what’s going on on your production systems.

But why would you want to do this? Because we have tests, we test things. So, most of my teams that I work on at the moment you know we will test things before we merge code in, then we will test some of the build before it gets deployed and after it gets deployed we will probably have some smoke tests in production as well to make sure that everything is fine. And that’s all well and good and then you rely on monitoring afterwords so if suddenly you see a spike in errors you know that something is going wrong.

But what if you are in a scenario where these things kind of don’t work and this is what happened to me when I was working at the Guardian, I worked our editorial tools team, and we built all the tooling in house that our journalists use to create content. Whereas the people who worked on our website, that team, they could rely completely on passive monitoring to know what was going on in production because they were getting millions of hits per day. They can tell something was wrong very very quickly because their passive monitoring could tell them.

Whereas our editorial tool probably had 12 concurrent users at any one point, you know if you’ve got an editorial team of about a thousand, but not very many people you know writing things in or particularly publishing content at the same time its very difficult.

Classic monitoring isn’t really going to tell you really everything about that, so what we tended to have was we would deploy something, we’d run our kind our smoke tests in production and be like, “Oh yeah, cool everything’s fine.” and then a journalist would come over about, you know anything from two days to an hour later and be like, “Everything’s wrong.” And we’d look at our dashboards and “No, no, no, it’s fine, it’s fine” and then everything would start going red. So our monitoring was slower than our users coming over and saying this doesn’t work.

So we had this kind of perfect storm of, as our product manager put it, having no idea what’s idea of what’s going on in production because we’ve got all this great monitoring set up, we were testing in production as well, it didn’t tell us anything. And this kind of scenario, you hear about it happening to loads of people these days because if you think back in the day when people used to deploy like, once every two weeks or whatever. Production was static. Production was super super static, so if you tested it was basically going to stay the same, you’d have your traffic fluctuations and stuff like that, but you’re in a kind of a safe space.

Whereas like, nowadays the teams that I’m working on everything was a micro-service, we were deploying probably 30-40 times a day across these various services. Prod is just like this moving target, it’s like whack-a-mole, so even if your run tests on newly deployed, one of your micro-services, you run tests on it and its great then, but then another service gets deployed and suddenly yours starts behaving differently. It’s like it’s this moving target that you’re trying to hit and know what’s going on.

So we found this gap, we need to know much, much quicker what is actually going on in production. So what we started to do was we built out a system that was, we kind of took the production tests that were already running, so we published a piece of content and then take it down again. And we started running them continually, and so we would be publishing about a thousand pieces of hidden content per day and then seeing what happened with them.

So when we first started doing this we hadn’t heard of the term synthetic monitoring, we were just like lets just run our tests like constantly, get them to tell us things, see what happens. We thought we were great, we’d invented something new and then we found out its actually called synthetic monitoring and people have been doing it for ages, so we felt considerably less clever then.

We started this lo-fi we ran some tests, I had a Jenkins-Box and I would just running them and then it starts to scale up what you start thinking about, what you can do. So kinda being quite generic, but I’ll kind of use examples from some of the systems we built, but what you’d actually need if you wanted to start running tests in production as synthetic monitoring.

You immediately start thinking about testability, so using the example of the piece of Guardian content, we have the ability to hide content and you would only be able to view it from within our building, from our network for ages. So we could publish a piece of content no one externally would be able to find, but one of the other things we wanted to monitor was taking content down because that’s like the scariest thing for a newspaper.

If you can’t publish that’s embarrassing, but if you think you think you published something that turns out to be incorrect you could get sued for libel, all that kind of thing. So being able to take stuff down is super, super urgent, that was actually the scariest thing. So we were building this system really to know that, because take down doesn’t happen very often its something you always wanna know it works all the time.

So we started finding things like, we built in testability that we can hide content. But when we started doing this constantly, we’re taking down a piece of content basically every 30 seconds it turned out our legal team would get a report automatically every time. So we scared the crap out of our legal team for a few hours, they were like, “What is going on?” So we had to start thinking how is this you know you are testing production you got to be super, super safe. So you need to be at the point where your users aren’t gonna see anything, so building testability into your system to kind of allow it to be able to accomplish that.

So yeah, staying hidden, clean data and no user impact, you don’t was the user to ever notice this. So what also came up, because we’re going to be publishing an actual piece of something to the website, it was incredibly hard to find, but just in case anything went wrong with that, filtering or anything like that. We were like, “Well, this piece of content, what’s it actually going to be?” And someone said “Oh you could just do a fake news article.” We then spent a bit of time with our legal people and cleared that we would have a piece that is completely clear about what it is, it says “I am test content.” and it’s kind of a small blog post explaining why it exists. I mean don’t go on theguardian.com and try and find this, you won’t find it, it’s basically impossible, but just in case anything ever goes wrong with it we’re covered.

So I simply started off and started to scale this up into a service on it’s own and other places where I’ve done this subsequently his is kind of the really tricky part. How often do you want to be running tests? If you’re doing something super, super simple, you’re just hitting certain API endpoints or something like that, you couldn’t be doing this constantly incredibly, incredibly fast. It can then mean you’re producing an awful lot of data to pass through it, things like that.

So kinda getting the scheduling right for what you’re doing, for instance this Guardian piece of content, we do because of how long it took to run the test because it was a full browser automation based test because we wanted to use it to kinda see it simulate actual user actions. That actually took quite a long time to run so we could, as soon as one finishes, start the next one and keep going because it would take about 30 seconds to one minute to run. But the scheduling is something which is quite tricky to get right and most test frameworks, if you are using an out of the box test framework it normally doesn’t have something for, “Run this continually forever.” So you have to kind of build that app yourself.

Then you get into alerting and reporting, so if you think almost every test framework is geared around pre-production testing. So when you run a test, whether it’s red or green that triggers an action so if tests are green, all good, merge your code, do your deploy, whatever. If they go red, all right investigate, there’s something wrong with the test, have you broken something, fix it. Whereas if you’re running tests continuously in production, if they go green great, you don’t care, think everything’s thing, keep them going. If it goes red you really want to know and you don’t just wanna look at a dashboard or maybe look at your Jenkins Jolt and suddenly it shown red, you want to know really quickly because it’s production. So you are there looking at this kind of, by this point very augmented test framework up to all your alerting, so things like pager duty, that kind of stuff. You want to get paged in the middle of the night when this sort of stuff happens.

And similarly how do you clean up after yourself? You’re creating test data, what’s going to happen to it? Depending on what kind of system it is, if you’re just hitting any of the API endpoints that’s fine, but when were creating a piece of content if something failed, if it was successful, get rid of it, clean it, make it go away. If something breaks and you find an issue you wanna keep that data there for as long as possible because you want to be able to investigate from here. So having rules about how you’re going to clean up after yourself, how you’re maintaining your data is really, really important.

So that was basically how we did it. Super, super quick, this is actually a 45 minute conference talk that I’ve condensed into 15 minutes. So there’s no screenshots or massive amounts of GIFs or anything like that. This is just kinda like an idea of how you can go from basically a standard test framework to the things that you need to build around it to be able to start doing synthetic monitoring. But it does come with a lot of challenges, first one being your tests needs to be incredibly robust. So again, saying before, each test creates an action. If you think about it, quite often when you’re in those sorts of scenarios with a regular testing framework, say your tests probably get run like fifteen times a day something like that and test accuracy is a thing people are super concerned about in the testing and quality industry.

But most tests usually, this is purely anecdotal from talking to lots of people there is nothing scientific behind this, most people’s tests middle out about 95% accuracy, probably. So that means if you’re running your tests 15 times a day you’re gonna see one incorrect failure. And that’s what I mean by accuracy, when the test is green, everything’s good, if the test is red, everything’s bad it’s not a problem with the test. If you’re getting 95% accuracy that’s absolutely fine if you’re tests are being run 15 times a day because you’re only gonna seem them probably come up once a week. That’s probably not ideal, but that’s kind of proportional, the effort to go any further.

But then if you’re running your tests several thousands time a day and they got 5% were flakiness, that’s an awful lot of alerts, particularly if you’ve hooked this up to your paging system and you’re gonna get woken up in the middle of the night, you’re gonna really, really piss people off. So your tests have to be so, so robust. If you’re just using http library and pinging endpoints, pretty easy to do. If you’re doing something like simulating user actions, even doing browser interactions that’s really hard, its a massive amount of effort. So the system that we built to publish the content, probably 85% of the time we spent on this was, there was all the effort to build out a product that could do this, but most of the time was spent on getting tests like minutely perfect.

Which also, kinda touching on that, keeping the noise down, you are gonna have test flakiness its gonna happen, we got to, we worked it out as about 99.92% accuracy. You’re still gonna get some false positives, weird things happen, the internets happens, stuff slows down, all that kind of stuff. So you need to have tolerance, you can’t alert of everything. Someone I worked with we was doing this, his team was like a regular back end API, they used test retries.

A lot of test frameworks have, you can annotate the test and say, “Retry” if it fails just do it again and that for them was kinda fine. For us we started doing that and then realized we missed all our intermittent issues. So if you’re running a thing loads and then you say, “Oh if it fails just run it again if that’s ok” cool, don’t alert anyone. So we built in a tolerance into our system that if the test run was aware of the last 20 test runs and so if you got one failure it kind of went to a warning state, the because it was quick enough if it failed again any time in the next 20 then you alert because you probably got intermittent issues, somethings going wrong. So having these kind of tolerance levels in your system, when you’re gonna be running these tests over and over again just to deal with these slight inaccuracies that you’re gonna get.

I wanna touch on trust, that is the absolute biggest thing that is, if you think that you’ve got a test framework and your tests go red quite often and they aren’t reliable you know exactly what happens, everyone starts ignoring them, nobody pays attention. It’s like, “Oh the tests are red again, who cares, it happens.” With this sort of stuff if you’re gonna be actively creating events on your production systems and then alerting people with, you’ve only got one or two chances to do that, if you wake someone up in the middle of the night and it’s total bollocks, everything’s fine they’re not gonna trust your system and they’re got gonna trust you either so they are going to ignore next good idea as well.

So building up trust with this we and I mean we had great success with this so I only let it alert myself for the first three months, which did lead to getting woken up a few times, but you know if it’s depriving you of sleep you’re super focused the next day if a little crap at code because you’ve been woken up all night. But we got to the point where once I was happy with it and rolled it out to a few other people and a few other people on the team and it got the point now where if I met up with the product manager from this team, I haven’t worked there for a year. I met up with him a few weeks ago and he said, “It’s still going.” They still trust it, they did a full database migration replaced MongoDB and switched to Postgres and looking into that they’re like, “Oh, everything’s fine”. Then this went off and they’re like, “Oh yeah cool, still going, still working.” And it was finding stuff, all things that they wouldn’t have found through regular monitoring or would have found it very, very slowly.

So yeah that’s basically it, I realize I’ve spoken very quickly, but thank you very much does anyone have any questions?

So the question is; did we see any effect of this on performance? No, not really, not in this one. Not in this particular example, other times I’ve done synthetic monitoring I definitely have, but not on the primary systems it was kind of knock on effects in other places. Suddenly we had, you know century the thing for logging errors and stuff like that, they actually a rate limit that’s really hard to hit, it is possible to hit that if you’re doing this. So little things like that around it kinda get in the way limiting places that can happen. But again it’s that testability though, you’ve got to make sure there’s not affect of the user. Even if you are spamming century, you’re annoyed at yourself, you’re not annoyed that the user, the user is the priority.

Okay, so the question just to summarize it was, How do you prioritize what you’re going to do this with? Do you go after every edge case or do you focus on certain things? Is that a good way of summarizing?

So the best way to think of this is to not think of testing, I kept saying tests, it’s really monitoring events. Really, but you’re using your test framework to do it. The main thing is think of it as monitoring. You want to perform the actions that are going to give your passive monitoring. You’re going to get to trigger that sort of stuff. Even just like how you’ll check to go to pages and stuff like that. Depending on what you are doing, let’s say you were just, the other example I used just like getting an API over and over again. You know, that is cheating. That is easy to do. You could hit all of your end-points with various different kinds of parameters and do that really easily and really cheaply. If you are doing something much more black box user-based, you know user interaction based testing, that is expensive. That takes time.

In those instances you would be like, “Well, what are you trying to achieve?” Well, you are trying to achieve monitoring, so that would be the focus. Not to get blanket test coverage and stuff like that. That is what pre-production testing is for.

Avatar
Kim is a writer and editor. She’s pretty excited about the technology that LaunchDarkly is building and sharing that story with the community. Before LaunchDarkly, Kim did marketing for other SaaS companies, like Intuit and Pentaho. She earned her BA in sociology from UC Berkeley.