Testing in production plays critical role

Effie Mouzeli, a Site Reliability Engineer (SRE) at the Wikimedia Foundation (parent company of Wikipedia), spoke at the November Test in Production Meetup in Berlin, Germany. Effie recounted the steps the organization took to migrate Wikipedia from HVVM (HipHop Virtual Machine) to PHP7 while serving 100,000 requests per second and 17 billion page views per month.

"...no matter how good you are at testing your changes before rolling them out into production, you will never cover all the crazy ways your users find to break [things] in production..."

Watch Effie's full talk to learn how cross-team collaboration, the Wikimedia community, and testing in production made the project a success. If you're interested in joining us at a future Test in Production Meetup, you can sign up here.


Effie Mouzeli: Hello. I'm Effie. And I've been assisting admin and [inaudible 00:15:05] engineers for quite a few years, and I used to be completely against testing production until I started working for the Wikimedia Foundation. But we live, we learn, and we break in production. I'm going to tell you the short version of the story about how we migrated Wikipedia from HHVM to PHP7 without most of you noticing. Some of you did notice and reported issues, and we thank you very much for it. I would also like to make tiny disclaimer: I was not on this project from the beginning, but rather joined about halfway through. This was led by SRE, but of course was an effort of both SRE and engineers from different teams across the organization and cross-organization. My colleague Giuseppe, who's not here today, put a lot of time and dedication in order to make this happen, so it was only fair to put his name up there.

So, I'm an SRE with the Wikimedia foundation, the nonprofit that hosts and supports Wikipedia and sister projects. Our systems are running open-source software, all the content is managed by volunteers, and we host some really bizarre pages, like lists of people who died on the toilet, or a 10,000 word talk page about aluminium versus aluminum. This is true, I'm not joking. We are running one of the top 10 websites of the internet, and which comes with many challenges, which brings us to how do we simulate this volume and kind of traffic in a controlled environment in order to catch issues early on?

This is not cheap, in terms of both hardware and human capital. And being a small nonprofit, with our resources, it doesn't look like this is an option. And even if it was one, we would still have to test in production, because no matter how good you are at testing your changes before rolling them out into production, you will never cover all the crazy ways your users find to break production, or find that something is broken. So, our core software is MediaWiki, which is a 19 year old Apache, PHP, MySQL application, and around MediaWiki there's a very thick caching layer. In other words, when you're requesting an article, we will check a number of caches to find if we already have rendered it and serve it to you. And if not, we will have MediaWiki render it for you, and then cache it, and serve it back. Moreover, when you're editing an article or recording media, you're again making requests to MediaWiki. So, it is the most important thing of our infrastructure. Of course, we have many microservices supporting it, but this is special to us.

We have three types of clusters that run MediaWiki: the app servers, which serve web requests; the API that serves API; and the Jobrunners/Videoscalers which run and sync jobs like video encoding. Around 2014, we started using HHVM instead of MOD-PHP on Apache. HHVM, which is Hip Hop Virtual Machine, is developed by Facebook, and it is to PHP what JVM is to Java. It converts PHP code to bytecode, and that code is in turn translated into machine code and it's executed. The reason we switched to HHVM is simply that it performed better back then, and over time we learned to manage HHVM's weaknesses, and we were happy with its strengths, and life was good, until September 18, 2017, when Facebook announced that future versions of HHVM will drop PHP, and will never be PHP7 compatible. And we were running PHP5 at the time, but we would eventually migrate to PHP7.

Now if you're wondering what happened to PHP6, it's the same thing that happened to TCP version 5: no one knows. Lucky for us, early performance test we then would run back then showed that the improvements in PHP7 itself and the use of PHP-FPM would provide us with equal, or even better performance for our workload. Now, PHP7, PHP-FPM is the PHP FastCGI process manager. It enables PHP scripts to make better use of several resources without the additional overhead from running from within inside the web server. PHP-FPM can reuse working processes instead of having to create and terminate them for every single PHP request. The largest performance improvement that comes from enabling Opcode caching. With Opcode caching, the resulting upload from compiled PHP scripts, it gets cached into memory.

So, we had a really long and winding road ahead of us to both migrate from PHP5 to PHP7, from HHVM to PHP-FPM, while serving 100,000 requests second and 17 billion pageviews per month, so what could go wrong? To be fair, PHP5 code can run on PHP7. There were, of course, some backward incompatible changes, but since we're SREs, we didn't deal with that. But we rather rolled out PHP to production.

So, our clusters run Apache and PHP. Our first step was to install PHP-FPM on all servers, where PHP would continue to listen on local [inaudible 00:21:06] 9000, and PHP-FPM would be listening on a UNIX circuit. So, in order to be able to route traffic to PHP7, we surprisingly used a cookie. If the PHP cookie was present, Apache would rather request PHP7. Obviously, we wanted to push live traffic to PHP7, but to keep the blast radius to a minimum on every step of the way. When we are making groundbreaking changes in our production like this, we can count on two things: the WikiMedia debug browser extension, which is for Chrome and Firefox, which is our number one help when testing in production. By using this extension, we are able to route requests to specific debug servers from our browser. But we can also add experimental features, like injecting the PHP7 cookie.

The second thing we can count on are beta users. They're logged in users who have enabled all beta features. We are grateful to them because they usually catch issues and report them, and that help is priceless. On the other hand, they are guinea pigs. So our initial testing would get feedback from two inputs: people who are using the extension, and beta users. There are a couple of more things that we took care of in order to make this experiment work. Configure CI, to test everything against PHP7 as well, and observability, which means metrics, graphs, centralized logging, along with some transitional graphs like how much traffic we're pushing on PHP7.

Now, there was a problem though with this approach. As I said before, there is a very thick caching layer in front of MediaWiki, and since we are caching URLs, there was a chance that a user with a PHP7 cookie would get a cached page that was rendered by HHVM. So it kind of beats the purpose. As a result, we needed to tell our caches to cache separately URLs where the PHP cookie was present and where it wasn't. So PHP7 was serving a few users. Our cache was serving them well. We had observability and everything was in place. We found a few issues, we fixed a few bugs, and we moved to the next phase.

Now, we needed to start sending anonymous users this time to PHP7. We added in an application variable where we could control the probability of a user getting the PHP7 cookie without knowing, which would expire after a week. With that setting we could claim that we are pushing 1%, or 0.1% of anonymous traffic to PHP7. That was not entirely true, because not all our users, or not all devices, support or accept cookies. As you can imagine, the more traffic we pushed, the more issues would surface, which sometimes would even affect our morale, like PHP Opcache corruption. The thing that was making PHP7 perform better was being corrupted from time to time. Fighting our way around those, we finally made it about 20, to be more precise, so-and-so 20% of traffic.

And now at this point, our goal was to push traffic to the app servers, but also to get ready for the next steps, the API servers. The API servers and the application servers share the same Apache configuration. In a similar manner that some app server clients don't accept cookies, there were a few API clients that did. So, those clients had the same probability of getting the PHP cookie. In order to push more traffic to the API clients, we would have to start converting API servers to run only PHP7. We are using Puppet to provision our servers, so we introduced the php7_only flag. If a server had this flag enabled, it would get a different Apache configuration, and it would only [inaudible 00:25:35] from using PHP7. Each API server serves about 2% of API traffic, so we had a rough estimate of the amount of traffic we're pushing there. We migrated a few servers, things were looking ... things actually went well here, and we moved onto the Jobrunners.

Now, we have a few dozens of [inaudible 00:25:58] jobs that we send to the Jobrunners, like sending notification emails, or update caching, or update caches for specific URLs. After switching all jobs in our staging cluster, we move to production. We had to switch first the higher traffic jobs, the jobs that run all the time, one by one, and carefully. And after that we continued with the lower traffic ones, and after we were at about 50%, we switched the rest of them in one go. Switching jobs went also surprisingly well. We were stunned as well. At this point, we enable the PHP only flag on all Jobrunners.

Now, finishing up. When we had all Jobrunners converted to PHP7, about 30% of API traffic, and 30% of our web traffic being served by PHP7, we were confident enough to continue converting servers faster. And that's what had actually happened. Within two days, we converted all API servers from 30% to 100%, and the same for the app servers. And that was it. In September, we got a 15 month project to the finish line within two days, and no servers were hurt in the meantime. Right now, we are of course reverting all the transitional change that we have been doing all this time, and like different caches, Puppet features, HHVM code in Puppet, but it is easier to remove stuff if that thing is not working anymore either way.

I omitted getting into every single detail we'll run into, and every single issue, or the shameful things we had to do in order to proceed with this project, but I tell you, each time we thought we were getting closer, each time we'd find another roadblock, and another one, and another one after it. And it was a really long migration that wouldn't be possible without the hard cross-teamwork, without testing production, and of course the help of the community. And a friendly message from our sponsors. Thank you very much.