So You Took Down Production - Now What?
A previous version of this article ran on Hacker Noon.
We've all been there. You're pushing to main, something you've done hundreds of times before, and by the time you go to refresh your production environment, you notice something's wrong. When you take down prod, the point at which you see is already too late. Your name is on the deployment, the issue is live, and the clock is ticking.
So what happens now?
The first thing to do is understand where you've gone wrong. Use the diff command to highlight the code that's changed without relying on your own eyes to spot the change. We can sometimes become so accustomed to the way we write our functions that a breaking change will be practically invisible. Remember those recent additions like git switch and restore are there to make your life easier and help you triage while looking for a fix.
Next, you need to accept that it might not just be the code. It might, in fact, be a combination of your environment, your infrastructure, and the combination of all three being stress-tested by particularly unusual behavior. Don't assume that you can handle this on your own. Simply rolling back your change may not work. Contact DevOps.
It's always better to ensure that the people who need to know about a worst-case scenario are alerted if there's even a remote chance it could be in effect.
The most important thing to remember is that this happens to nearly every engineer—no matter their status or tenure, no one is immune to taking down production. Whether you're joining a new engineering team or embarking on a new stage of a product's life cycle, the likelihood of you breaking production is rarely a case of if. More often than not, bringing down production is closer to a matter of when.
Case in point, earlier this year, HBO Max surprised its 44 million-strong user base with an integration test email that went to everyone. While this example is one of the more achingly public displays of what happens when your production build goes awry, given enough honest discussions and code reviews, a forensic investigation would undoubtedly uncover countless instances of these occurrences happening.
One of the most reassuring principles of software development is that you can usually count on the task you're trying to accomplish having been performed by someone else. This goes for both the good and the bad. Whatever the dilemma, you can almost guarantee someone out there is experiencing a scar-tingling Harry Potter-esque flashback on your behalf. Here's how not to do it again.
Set the stage for rehearsals
When you're running a race, you want the conditions to be right: not too hot, a manageable incline, and most importantly, an easy-to-follow route. The same goes for your training runs. You ideally want to be stress-testing yourself under conditions that prepare you for the final event. This analogy couldn't ring truer for your development environments. If your change is going to break in production, it needs to break in staging first.
"Staging is where you validate the known-unknowns of your systems."
It's where you can get the chance to stress-test your build and see how your dependencies hold up under a feasible amount of pressure. Try to resist the temptation to make things easy for yourself in staging, and ignore the idea that all staging environments need to be identical.
While recently serving as the Director of Engineering at Nordstrom, Josh Maletz, when discussing his team's approach, maintains a myriad of environments, stating, "We have staging environments for 20-year-old monolithic applications with hard-wired dependencies to vendor test environments. We also have serverless applications with infrastructure-as-code deployed to the cloud where we employ robust service virtualization techniques to exercise our system against simulated erratic behavior." The critical factor is that your staging environment is fit for purpose rather than fitting a uniform specification.
Gather an audience
Access management gets overlooked as a vital component of the process here. Senior Developer, Victor Guzman, noted in his DEV Community series on developer fears that in one of his instances of an unintentional push to production, the event didn't result in any breaks per se.
Still, it did highlight that there was no policy around access management. This wouldn't have happened without someone initiating the difficult conversation around what went wrong.
To improve, we need to be open to having an honest, blame-free post-mortem. Ultimately, when it comes down to it, one person shouldn't be able to break production. Sometimes the question shouldn't sit at whether or not you should make that change, but rather should you be able to make that change.
It can be far too tempting to hide your work, especially when things go wrong. The immediate urge is often to bury your mistakes and pretend that they never happened. This is great for the ego but terrible for personal growth. As mortifying as it is, we've got to learn to wear our errors like badges of honor. We earn our stripes through understanding what not to do and the most valuable lessons to learn come with a chaser of pain.
But outside involvement isn't only beneficial from a learning perspective; it also enhances the efficacy of your testing procedures. When it comes down to it, oftentimes, the people writing the tests want them to pass. As my excellent colleague, Heidi Waterhouse, notes, your test suite should include the following types of tests in a pre-production environment before releasing new features:
- Unit tests to check the smallest unit of code that can be logically isolated.
- Integration tests to test software dependencies and the combination of various software modules.
- Performance testing which includes load tests and stress tests to identify how a system or software behaves when there are spikes in production traffic.
- Regression tests to confirm previous functionality continues to operate as expected.
- Functional tests to confirm the feature performs or functions as expected.
- Lint tests to check your source code for programmatic or stylistic errors.
- Usability tests to validate the ease of use with end-users.
While we can strive to create the best possible environments for staging, no traffic synthesizer will truly be able to replicate the stresses of production.
Testing in production is necessary to make sure what you're building will actually hold up when in the wild. Wrapping your change in a feature flag is one of the safest ways to ensure that you've got overall control over what you deploy in production. Once you're confident that your changes won't break and your tests are passing, you can roll out your feature more widely, safe in the knowledge that you'll always have an off-switch.
If you're interested in stories of testing in production or the challenges that impact your DevOps and SRE teams, then check out our third annual conference, Trajectory, taking place November 9-10, 2021.