10 Days of Errors
10 Days of Errors is brought to you by Launchdarkly and our partners at Rollbar. 🦇
It's finally the time of year we've all been eagerly dreading: spooky season. Where howling bugs lurk in the darkest corners of your code, and tales of fascinating exploits will shake even the toughest Security team.
Put away your pumpkin spice lattés and plastic skeletons, because we're here to talk about what sends chills down your spine and goes “ERROR: primary database disk full” in the night. Ask not for whom the pager beeps: it beeps for thee.
To start, we've collected ten terrible tales over at Rollbar's blog. But that's just the beginning. We have more blood-curdling bug stories here, and we'll be adding them to this post day by day. If you've suffered some horrifying misadventures of your own, let us know on Twitter: use the hashtag #ErrorHorrorStory. Pending on how spooktacular they are, we'll include them here. 👻
Join us as we spend the next 10 days sharing chilling tales of errors that even Poe himself couldn't bear to write....
Day One: The “Howling at the Moon” error
Let's start with an old but marvelous tale from England. Originally written by someone known only as “Taz”, then forwarded from list to list back in 2002:
The location was a server room, somewhere up on the 4th or 5th floor of an office in Portsmouth (I think it was) by the docks. One day the main Unix DB server fell over. It was rebooted but it happily fell over again and again, so they called out the support company. So, the support gadgie – Mark I think his name was, not that it matters – gets there a few hours later. Leeds to Portsmouth, you see… it’s a long way. He switches on the server and everything works without error. Typical bloody support really, client gets upset over naff all. He goes through the log files and finds pretty much nothing that would make the box fall over. Mark then gets back on the train and heads back up to Leeds after a pointless waste of a day. Later that evening, the server falls over. Exactly the same story… it won’t come back up. Mark goes through all the usual remote support stuff but the client can’t get the server to run. The pattern continues for a few days. Server working, then after about ten hours it falls over and won’t run for another two hours. They checked the cooling, they checked for memory leaks, they checked everything but came up with nowt. Then it all stopped. Then, a week without problems. Everybody was happy… until it started again. The same pattern: 10 hours on, 2-3 hours off. And then somebody (I seem to remember he said that the person had nothing to do with IT) said: “It’s high tide!” Which was met with blank looks and probably a wavering hand over the intercom to Security. “It stops working at high tide.” This, it would seem, is a fairly alien concept to IT support staff, who are not likely to be found studying the tide almanac during the coffee breaks. So they explained that it couldn’t be anything to do with the tide, because it was working for a week. “Last week was Neaps, this week it’s Springs.” Here’s a bit of jargon busting for those of you who don’t have any RYA qualifications. The tides run on a lunar cycle, and you get high tide every 12.5 hours as the Earth turns. But as the moon’s orbit changes, so does the difference in high and low tide. When the moon is between us and the sun or on the opposite side of the planet, we get “Springs”. These are the highest highs, and the lowest lows. When there’s a half moon, we get “Neaps”. The difference between high and low is greatly reduced. The lunar cycle is 28 days, so Springs – Neaps – Springs – Neaps. Sure enough, they were right. Two weeks previously, a Navy destroyer or something had moored up nearby. Every time the tide got to a certain height, the crow’s nest was in direct line with the floor that the servers were sitting on. It seems that the radar (or radar jamming, or whatever the military have on their toy dinghies) was playing havoc with the computers.
It may sound like it was easily resolved, but ask yourself: how likely would you or any of your colleagues have been to make the connection? The thought makes the Brits among us want to hide under the duvet with a hot mug of Horlicks. And we'll never look at the moon in quite the same way again.
Day Two: the “Your Bug Possessed My Printer” error
We're used to software crashing, but when a bug makes an incursion into physical reality, it's the next level of scary. Imagine this: one day you're being a Good Computer User and installing updates -- you know, like you're supposed to. In the middle of it all, the device sitting next to you suddenly starts spewing gibberish and malfunctioning as if it was possessed<.
The original version of this story comes from Mark Ferlatte, who worked with Yoz Grahame, co-author of this post, at Linden Lab - the company that runs the Second Life virtual world. There are enough bizarre bug stories from Second Life to fill a book, but this one doesn't involve avatars, digital pets or other mind-boggling virtual distortions. It's from the earliest days of the service, back in 2003, and it's a miniature tour through the history of personal computing that's as wide as it is brief, where a set of old and new technologies worked together in exactly the wrong way.
The Second Life desktop app for Windows had a built-in updater that worked in the simplest way possible: it would download an updater program from a pre-set URL on the Second Life website, save it to the user’s hard drive as updater.exe, and run it. One day, for reasons lost in time, requests to that URL returned a 404 Not Found page. Perhaps the server was misconfigured, or the file was deleted… who knows? These things happen to every website. In most situations like this, the user trying to access that page is a human who gets annoyed and does something else, and that’s the end of the story. But when the user is a program, the outcome is less predictable. In this case, the desktop app didn’t check what it was getting. It just downloaded the 404 page, saved it as a program updater.exe and ran it. Or rather, it used a system call to tell Windows to run it. Windows looked at updater.exe and immediately saw that it wasn’t in the standard format for .exe files. But instead of just quitting with an error, this system call decided to try harder. Windows is known for its dedication to legacy support: you can still run programs from back in the MS-DOS days, ones that had the suffix .COM. Unfortunately, this system call didn’t check what it was getting, and ran updater.exe as if it was a .COM program. Binary data is binary data, and can be interpreted in infinite ways. It seems that – for entirely accidental reasons – when the Second Life website’s 404 page is interpreted as a .COM file, it can be read as a set of instructions to open the computer’s LPT device and yell gibberish into it. Which is what it did. Those of you old enough to have used MS-DOS may remember that LPT is the route to the printer. Windows, once again, went the extra mile to help legacy DOS software work. It received the gibberish spewed by the 404-page-run-as-a-program, didn’t check what it was getting (you may be sensing a theme here), and helpfully sent it along to whatever printer it knew about. Fortunately, plenty of printers do check what they’re getting! Unfortunately, that still leaves plenty that don’t, especially if they’re cheap inkjet printers from the 90s. And those printers went wild. In Mark’s own words: “… they would freak out, spew paper, and in one case, physically break.”
There are lessons here for us all. If you write software, make sure that it validates its data before attempting to use it. (No, really.) And remember this story the next time that you witness unexplainable behavior in your technology — who knows what ancient spells may have been invoked? 🔮
Day Three: the “Like It's Going Out Of Style” error
Etsy is a magical place for home-grown style. Each seller may only have a few products, but with over 3 million of them, selling to 60 million buyers, that's a lot of traffic. Imagine how excited you might be, as a young engineer, to work on such a popular site! Imagine how eagerly you'd make a change that goes live in your very first week - even if it's just a tiny change, like removing some redundant style sheets. Now imagine how you'd feel if that tiny change brought the whole site down…
It was a dark and stormy night, and all was still in the Etsy office… save for an eager developer trying to ship a change on their first week at the company. They were deleting a harmless, unused style sheet — originally needed to support the ancient IE6 — and all the files that included it. Tested out on staging, everything looks good! Tests pass! Time to ship it. But lo and behold, there was a bug! A race condition on deploy meant that some of the updated pages weren’t deployed properly, and the old versions tried to include the now non-existent CSS file, which threw an error. This wouldn’t have been so bad, except the error page also included the aforementioned, not-so-harmless CSS file. So errors begat errors begat errors, resulting in a hard loop on every Apache process on every box. The looping Apache processes locked up the servers so hard that they couldn’t be restarted remotely. Some poor Etsy employee had to go to the datacenter and reboot the servers manually.
Fortunately for the new developer, things turned out fine. Etsy's engineering department has a positive attitude to learning from mistakes. They even hand out an annual award for the most surprising error: a three-armed sweater, knitted (of course) by an Etsy seller. I wonder if that poor developer will wear it for Halloween?
Day Four: the “Payments Not Found” error 🤑
For Throwback Thursday, this bug comes to you from about 20 years ago, told by someone who wishes to remain anonymous. This particular engineer was new at coding, and didn't have a lot of formal training. “Tests are for suckas, my code compiles, so everything works” was the common joke amongst this engineering team. The young engineer followed suit, but still worried about things that might go wrong, especially during testing. Unfortunately, while they were thoughtful about some problems, they forgot to worry about others…
The task was to set up an e-commerce store that sold industrial parts for a mine site, think Grainger catalog, but for civic corporations. The development team came together, split up the tasks and went to work building. Pretty straight forward project; a catalogue of items, authentication and a checkout with credit card. The team finished the project and felt pretty good about it. They shipped it and were proud that the project went so smoothly. Fast forward 18 months after shipping the website, when a bug was discovered. The developer was worried about real charges being incurred while the credit card payments were being tested. So their code checked if the environment was set to dev in which case, regardless of the amount calculated at checkout, no actual transactions would be sent to the credit card processor. When the project was shipped, that environment flag was never changed to prod, so even though everything looked good on the customer end for adding items to their cart, checking out and receiving the invoice for the correct amount, the credit card was never charged for the said amount. Instead, it consistently charged the customer a whole amount of ZERO dollars. By the time this was discovered, they estimated it was something like 2 million dollars of goods that were shipped and sold, with no payments collected.
Fortunately for the developer, the company website was primarily selling to other internal companies, so the missing money ended up being mostly an accounting issue. However, you can be sure that they not only learned the value of tests, but of testing the production environment too!
Day Five: the “Black Hole Sun” error
This comes from another LaunchDarkly Developer Advocate, Dawn Parzych. She wears many hats, one of them being a solution architect. Let's dive in to see some of the problems we encounter when working with hardware and software!
10 years ago, I was working for F5 Networks as a Solution Architect. Our hardware had been deployed at an ISP to improve web performance for mobile users. Every day at approximately the same time the internet stopped working for about 5 minutes, and system processes were restarting. I was sent on-site for the day to troubleshoot. As the day wore on I realized the quick day trip was turning into an overnight. All diagnostic and troubleshooting was showing no problems, the problem occurred at 3am. I had to be on-site to see what was going on. A last minute hotel was booked, and I went shopping for a change of clothes. I got a few hours sleep so I could be on-line at the witching hour to see what was wrong. Sure enough as expected, the system failed. Within 5 minutes everything was working again. Everything was fine with the hardware. I went sleuthing on the internet, armed with the timestamps of when the failures were occurring. None of the standard searches were resulting in a solution. Finally, I had an epiphany! Eureka! With some more in-depth research, I realized the problem. The outages were aligned with reported solar flares in the region. The flares were knocking out satellite connectivity taking our systems offline. When the flare passed the systems came back on-line.
We started the 10 Days of Errors with lunar interference, and now we have solar chaos too. Who needs alien invasions when the celestial bodies themselves have it in for us? We hereby request that the universe just calm down a bit. Meanwhile, we'll be in our bunker. 👽
Day Six: the “Bad Case of the Mondays” error
Friends, we made it through the first five Days of Errors! We hope you had a pleasant weekend, and that you were able to keep the terrifying fragility and unpredictability of software bugs away from your attention for long enough to achieve something resembling relaxation. Or maybe you lost Sunday to helping your kid install twenty-three wildly-conflicting Minecraft mods, followed by a night of fitful dreams filled with Java tracebacks.
Either way, give it up: It's Monday again, when we head back to our usual 9-to-5 battles with enterprise technology. This story from Adam Kalsey seemed appropriate to share. It's set a decade ago, back when he was building a worldwide instant messaging service using the XMPP standard. (If you don't know what XMPP is, congratulations.)
As IMified grew, most Monday mornings we’d have an unexplained crash of our core XMPP server. We could see there was lots of network traffic around 9am PST, but it was encrypted and commercial XMPP servers didn’t give tools to analyze at any level of detail. Solution? We built our own XMPP server to instrument and analyze the traffic. And then we saw the source of the traffic. Presence packets. 9am PST is a magical time: West Coast workers sign on, East Coasters leave for lunch, Europeans leave work for the day, and finally, it’s bedtime in India. For an instant messaging platform, this is the time when the most people change state around the world. So the question remained: why Monday? We got just as many packets other weekdays, but spread over a larger amount of time. Our theory was that people seemed a bit more likely to be on time for work on Mondays. So to help combat this traffic spike, we made our XMPP server fork presence packets to another server that just served a static bit of xml for the XMPP packet. Voila! We were online and available. Always.
Oh, is he done? Sorry, we went into a fugue state at the first mention of timezones. (A feeling familiar to anyone who's had to debug calendaring code.) Mix that up with traffic spikes happening first thing on a Monday morning, and I'm feeling a dose of terror that beats a triple-espresso. No way are we falling asleep any time soon.
Thank you for the story, Adam. May your Mondays be calmer. ☕️
Day Seven: the “Bucket of Bugs” error
Software bugs can be scary, but they’re often funny too. If you want to be thoroughly tickled by logic going haywire, go look at video games. As games have become bigger and more complex, that complexity has produced more and sillier bugs. And when most of that complexity comes from trying to model the real world, the breakage can be hilarious.
As an example, let’s look at one of the biggest and best known games of the past decade, Bethesda Softworks’ Skyrim. It’s a giant world, packed with all the usual fantasy nonsense like spells and orcs and dragons and buckets. Yes, fantastical buckets. The terrifying complexity of Skyrim’s logic is so wide-ranging that even objects as basic as a bucket find themselves accidentally imbued with multiple unintended effects.
The most well known bug involving Skyrim’s buckets isn’t so much a bug in the bucket itself as a side effect of Skyrim’s world modeling. It’s also a valuable lesson in the ways that enhanced realism can cause problems when it’s not implemented evenly.
Computer-controlled characters in video games have behavioral logic so they can react appropriately to player actions. But this reactiveness is deliberately limited so that characters don’t react to everything; they should only react to the kinds of events that real people would react to in real life. Imagine a shopkeeper character: it shouldn’t care about the player moving objects around in the house next door, but it should act when the player takes something from the shop without paying. In most games, characters only react to events that occur in their immediate vicinity. More advanced games limit this reactiveness to events that happen in front of the character; this is important in stealth-based games so that players can, for example, sneak quietly behind a guard. For Skyrim, Bethesda took this even further: characters react to events which they can see with their eyes. It means that characters can turn their heads to notice more events. Unfortunately, the Skyrim developers didn’t implement the ability for characters to realise when their head is covered with a bucket, put there by a player who knows it’s the easiest way to steal things from shopkeepers:
The other flaw is less famous, but far more suited to a magical world: a specific kind of heavy bucket can make you fly.
To enjoy this bug: find a heavy bucket. Stand on it, in exactly the right place. Then grab the bucket. Now you’re up, up and away! We couldn’t find a definitive explanation. One speedrunner suggested that it’s caused by the game’s faulty implementation of Newton’s third law of motion. We find that unlikely, but the outcome of this bug is so bizarre that we can’t rule it out entirely.
Here it is in action:
If you want to enjoy these bugs yourself, we’re glad to tell you that they and many others still work even in the latest versions of Skyrim (though probably not in the Alexa version). To experience more hilarious video game bugs, we recommend almost any major video game, especially those with development budgets above $50 million. If you want a different kind of challenge, try reading the patch notes for The Sims out loud while keeping a straight face.
Day Eight: the “Save the Skating Horses” error
Now that we’ve wandered into the many worlds of video games and their hilarious bugs, Yoz insists that we return to Second Life, which we touched on last week. (Yes, it’s hard to locate “last week” in 2020, that’s why we put the link there.)
Unlike most video games, even the big multiplayer ones, Second Life is a giant open virtual world where almost all of the content is created, programmed and sold by its users. Much of that content is surreal enough when it’s functioning correctly. Throw a few bugs into the mix and you get utterly bizarre situations like this one, which happened while Yoz worked at Linden Lab, the company that runs Second Life:
One of the most valuable product categories in the Second Life (SL) economy is virtual pets. There’s enough profit in selling digital animals and their supplies to sustain several small real-world businesses. Take, for example, the tragic story of Ozimals, a company selling virtual rabbits which met with a messy legal end. Among the other kinds of virtual pets in SL were Arabian horses. Just like the bunnies, you had to buy them food. (The food is where most of the revenue lives in the virtual pets business. It’s just the Xerox model made weirder.) You’d put the food out, they’d find it and eat it, and some time later they’d want more. The “find it” part of the previous sentence is key to this story. For the horse to get to the food, they couldn’t just move in a straight line as there might be obstacles in the way. They had algorithms for finding a route around those obstacles and propelling themselves to the food. In video game terminology, this kind of algorithm is known as pathfinding. The SL developer platform now has built-in pathfinding functions which any developer can use, but they were added after this story took place. Back then, the horses used custom pathfinding code written by the horse creator. And, like most code written in SL’s haphazardly-designed scripting language, it wasn’t very good. As mentioned earlier, almost all the content in Second Life is created by its users. This includes tens of thousands, maybe hundreds of thousands of unique scripts, embedded in hundreds of millions of virtual objects. These scripts are executed by the underlying platform code, maintained by Linden Lab, and updated constantly. When there are new releases of the SL platform, it isn’t fully tested with all the scripts that content creators write. There are far, far too many to do that. One release had a tiny tweak to the physics engine, related to friction of objects moving on the ground. You may guess what’s coming next. The horse pathfinding logic was using the old friction rules. As soon as the SL region code updated, horses started skating past their food. In some cases, horses living on high-altitude platforms started falling off them. (I imagine them whinnying as they pirouetted into space.) This all seems really comical until you realise how many people owned these horses, and how much they’d spent on them. Within a couple of hours of the code going live, staff realised that user possessions worth tens of thousands of US dollars were being destroyed by a bug. (Yes, US dollars. Not Linden dollars. Just in virtual horses. You have no idea how much the SL economy is still worth.) Once the size of the problem was realised, the SL platform release was rolled back. But that still took several hours over the thousands of servers. (No feature flags, see.) So, during the rollback, several dedicated QA engineers stayed up much of the night, saving virtual horses from starving to death.
“This is why virtual world bugs fascinate me,” continued Yoz. “Some of them are just amazing.” Then he went on to tell us about a Linden Lab developer who accidentally raised SL’s sea level by 200 metres, which makes our current climate change nightmares look tame. That’s enough for now, thanks. We’ll stick with the horses.
Day Nine: the “Who Put That Code There?” error
We have only a couple of days left, and the last two stories have been more silly than scary. It’s time to change the tone to something more suitable to Halloween, and see if we can raise the hairs on the back of your neck.
For the past eight days, we’ve only shared true stories. Well, we’re fairly certain that they’re true. Admittedly, some of them took several hops before they reached us. None of us knows “Taz”, who provided the story we shared on day one. There’s a chain of trust involved, just like there is with software.
That brings us to today’s story, Coding Machines, which comes from Lawrence Kesteloot. It’s far from the longest story ever written about a software defect (take Ellen Ullman’s magnificent The Bug, which we highly recommend) but it’s too long to include here. This evening or weekend, set aside half an hour for this deliciously spooky tale and give that spine a good chilling. You can read the whole thing online here, or buy it as an Amazon Kindle book.
And as for its veracity… we think it’s probably fiction. Probably.
Day Ten: the “$400 million down the drain” error
We want to end with a bug that’s huge. It’s gotta be scary. We want to destroy what’s left of your sleeping habits with wide-eyed terror. How better to do that with a tale that checks those boxes so well that it’s known in software history as Knightmare. It’s a relatively well known story in financial circles, but given that feature flags play a starring role, we want to provide our own perspective. Because feature flags are what we do. Plus, we get to revisit concepts seen in earlier stories: “developer mode”, insufficient testing, partial failed deploys, and somehow, virtual environments.
The story concerns Knight Capital, the company that pioneered algorithmic stock trading. They were big and rich and dirty with a machine that printed money. Then, in 2012, that machine turned on its creators and ate them.
In under 30 minutes, the “Knightmare” bug guzzled over $400 million. It didn’t even chew.
That’s the punchline. The story’s better.
Back in the mid-90s, two middle-aged finance executives, Kenneth Pasternak and Walter Raquet, saw NASDAQ going digital and said: what if the traders were digital too? What if it was just computers? They could do thousands of trades a minute. It was 1995 when they founded Knight Trading Group (later renamed to Knight Capital Group) with a bunch of smart engineers and immediately started making ridiculous amounts of money. Within three years they had gone public, their IPO (Initial Public Offering) raising $145 million on top of a market capitalization of $725 million. It took only 18 months for their market capitalization to grow over ten times as big. However, those who remember the period know what came next: the dot-com bubble burst, and the Knight Capital rollercoaster ride went screaming downwards. On top of the losses, Knight was caught breaking many stock market regulations and doing naughty things like intercepting customer trades so that the firm could profit off them first. As Knight floundered, the founders who oversaw these problems were replaced. Thomas Joyce, an experienced industry leader, took over as CEO and returned the firm to financial success. By 2011 they had a net income of $115 million and over 100 software developers working on Knight’s future: an overhauled, optimised version of its trading systems. The new technology was known as Smart Market Access Routing System (SMARS). When launched, it accelerated Knight Capital’s transactions to thousands of trades per second. In 2012, the New York Stock Exchange (NYSE) announced the Retail Liquidity Program (RLP), a new private market for stock trading. The RLP was highly controversial because it would allow big market players to trade for fractions of pennies above or below the displayed prices. For big players, those fractions of pennies would rapidly add up to millions of dollars in profits. Knight Capital, being a huge player, had to be in the RLP. But first, they had to tweak SMARS to work with the new market. Thanks to the controversy around it, the SEC had taken a while to approve the RLP. When the approval was announced in June 2012, the NYSE followed up with a rapidly approaching launch date: August 1st, 2012. The trading firms had just over a month to prepare. Knight’s engineers scrambled to make SMARS ready for the RLP. It needed a set of new code to handle the RLP’s rules, which were different from the rest of the NYSE markets. For SMARS to know whether to use the standard NYSE rules or the new RLP rules, they repurposed an existing feature flag that was previously used for some old testing routines. If this flag was enabled, SMARS would know that it was trading on the RLP.
Okay, hold up a moment. It’s always dangerous to take an existing feature flag and change its purpose. Excuse us if this sounds like advertising, but this is an important lesson in software safety: if it’s easier to repurpose an existing flag than to make a new one, then you need a better feature management system. (If you’re looking, we have a really good one right here. You can make as many new flags as you like without adding any infrastructure or configuration. Okay, advertisement over.)
So, these “old testing routines”… what did they do, exactly?
Back when Knight Capital was creating its initial trading algorithms – the ones that gave it such a massive lead in the market – it wanted to test them in a kind of virtual stock exchange; like a video game version of the stock market. But Knight’s own trading algorithms wouldn’t be enough to simulate market activity, because they also wanted to test how the algorithms would fare when the market behaves weirdly. Their virtual market neededotheralgorithms that would do irrational things. For example, they needed a market player that would buyhighand selllow, which is exactly the opposite of what a sensible player would do. They created this exact algorithm and called it Power Peg.
At this point, those familiar with both software engineering and horror stories might guess the next question: was this repurposed feature flag – the one now triggering the RLP rules – previously used to trigger Power Peg?
So… when the RLP code was linked to the feature flag, did the engineers perform thevitalsafety step of deleting the Power Peg code?
They did. However, this deletion happened at the same time as the RLP code was added. Power Peg hadn’t been used since 2003, yet lingered in the code ever since.
This is why LaunchDarkly provides theCode Referencesfeature: by telling you exactly where a given flag is used in your codebase, it makes it much easier to eliminate unused, potentially dangerous code. Did they at least test what would happen if Power Peg was accidentally enabled?
But in this case, the Power Peg code was eliminated in the new version of SMARS. So as long as the previous version was completely removed from the production servers, there would be no danger of accidentally triggering Power Peg. The new SMARS code was completed a week before RLP’s launch. Knight deployed the new version of SMARS to its servers. However, Knight’s infrastructure team was still doing deployments manually. The engineer responsible for the deployment had to install the new code across eight servers. Unfortunately, this engineer missed one of the servers, and it still contained the old code. Knight’s infrastructure team had no standard process for verifying that code deployments were correct, so nobody noticed.
Yep. At 8:00am on August 1st, SMARS switched on to handle pre-market orders, and immediately started sending automated warning emails to Knight’s engineers. These emails, containing the error message “Power Peg disabled”, should have been enough to warn staff that something was going wrong. But few engineers noticed the emails, and none of them realised the danger.
Email’s aterribleplace for urgent alerts to go. (This is why Rollbar providesnotification toolssuch as integrations with alert systems, special alerts for high occurrence rates, etc.)
At 9:30am, the new RLP market opened for business. Power Peg, still present on one production server, flew into action and started spending the firm’s money in the stupidest way possible, as fast as possible. Within a couple of minutes it had performed more trades on certain stocks than they would usually see in a month, and the total volume of trading was 12% higher than usual. At 9:34am, NYSE engineers investigating the trading spike traced it back to Knight Capital. Knight’s own engineers hadn’t noticed the extent of what was happening; they could see the higher trading volume, but didn’t realise that SMARS was now shoving Knight Capital’s market position into serious debt. Somehow, Power Peg’s activity evaded all their internal alarms other than the aforementioned warning emails, which had still gone unnoticed. By 9:45am, Knight’s engineers had looked at the problem, conferred, and leaped into action.
They flipped the SMARS kill switch, right?
Nope. SMARS, for some reason,didn’t have a kill switch.
But… kill switches are vital! And feature flags make themso easy! So what did they do instead?
The engineers believed that the dangerous trading was due to the new RLP code. They decided to immediately roll SMARS back to the previous version.
Oh no. No no no nooo.
Yep. Now, all eight servers had Power Peg enabled, churning away, trading as dangerously as possible. It wasn’t until 9:58am, that Knight’s engineers finally traced the problem to Power Peg and shut it down. But in that time, it had purchased $7 billion worth of shares. By the end of the day, Knight had frantically reduced it to $4.6 billion, but it was still far, far more than Knight were allowed to hold, because they didn’t have the cash to cover it. In the end, Knight Capital made a loss of about $440 million on Power Peg’s trades. The company immediately lost most of its biggest clients, its reputation destroyed. Knight was acquired a few months later by a rival trader for less than a third of its previous share price.
This story has so many different lessons for engineering teams, it’s hard to count them all. We’ve mainly focused on the ones to do with feature flags and notifications, but there were many others, as listed in the SEC report on the incident.
We hope you’ve enjoyed our 10 Days of Errors, More importantly, we hope we’ve instilled you with the appropriate amount of fear for software disasters. LaunchDarkly and Rollbar won’t prevent you from bringing bugs into this world; the choice to open that demonic portal known as a “code editor” is up to you. However, we can give you far better tools to battle whatever fiends emerge. You can use alerts to prepare for their arrival, send them back to the darkness with the flick of a kill switch, and be all cleaned up well before bedtime. Sweet dreams!
This post was co-authored by Ramon Niebla, Software Engineer at Rollbar.