10 Days of Errors
10 Days of Errors is brought to you by Launchdarkly and our partners at Rollbar. ��
It's finally the time of year we've all been eagerly dreading: spooky season. Where howling bugs lurk in the darkest corners of your code, and tales of fascinating exploits will shake even the toughest Security team.
Put away your pumpkin spice lattés and plastic skeletons, because we're here to talk about what sends chills down your spine and goes “ERROR: primary database disk full” in the night. Ask not for whom the pager beeps: it beeps for thee.
To start, we've collected ten terrible tales over at Rollbar's blog. But that's just the beginning. We have more blood-curdling bug stories here, and we'll be adding them to this post day by day. If you've suffered some horrifying misadventures of your own, let us know on Twitter: use the hashtag #ErrorHorrorStory. Pending on how spooktacular they are, we'll include them here. 👻
Join us as we spend the next 10 days sharing chilling tales of errors that even Poe himself couldn't bear to write....
Day One: The “Howling at the Moon” error
Let's start with an old but marvelous tale from England. Originally written by someone known only as “Taz”, then forwarded from list to list back in 2002:
The location was a server room, somewhere up on the 4th or 5th floor of an office in Portsmouth (I think it was) by the docks. One day the main Unix DB server fell over. It was rebooted but it happily fell over again and again, so they called out the support company.So, the support gadgie - Mark I think his name was, not that it matters - gets there a few hours later. Leeds to Portsmouth, you see... it's a long way. He switches on the server and everything works without error. Typical bloody support really, client gets upset over naff all. He goes through the log files and finds pretty much nothing that would make the box fall over. Mark then gets back on the train and heads back up to Leeds after a pointless waste of a day.Later that evening, the server falls over. Exactly the same story... it won't come back up. Mark goes through all the usual remote support stuff but the client can't get the server to run.The pattern continues for a few days. Server working, then after about ten hours it falls over and won't run for another two hours. They checked the cooling, they checked for memory leaks, they checked everything but came up with nowt. Then it all stopped.Then, a week without problems. Everybody was happy... until it started again. The same pattern: 10 hours on, 2-3 hours off.And then somebody (I seem to remember he said that the person had nothing to do with IT) said:"It's high tide!"Which was met with blank looks and probably a wavering hand over the intercom to Security."It stops working at high tide."This, it would seem, is a fairly alien concept to IT support staff, who are not likely to be found studying the tide almanac during the coffee breaks. So they explained that it couldn't be anything to do with the tide, because it was working for a week."Last week was Neaps, this week it's Springs."Here's a bit of jargon busting for those of you who don't have any RYA qualifications. The tides run on a lunar cycle, and you get high tide every 12.5 hours as the Earth turns. But as the moon's orbit changes, so does the difference in high and low tide. When the moon is between us and the sun or on the opposite side of the planet, we get “Springs”. These are the highest highs, and the lowest lows. When there's a half moon, we get “Neaps”. The difference between high and low is greatly reduced. The lunar cycle is 28 days, so Springs - Neaps - Springs - Neaps.Sure enough, they were right. Two weeks previously, a Navy destroyer or something had moored up nearby. Every time the tide got to a certain height, the crow's nest was in direct line with the floor that the servers were sitting on. It seems that the radar (or radar jamming, or whatever the military have on their toy dinghies) was playing havoc with the computers.
It may sound like it was easily resolved, but ask yourself: how likely would you or any of your colleagues have been to make the connection? The thought makes the Brits among us want to hide under the duvet with a hot mug of Horlicks. And we'll never look at the moon in quite the same way again.
Day Two: the “Your Bug Possessed My Printer” error
We're used to software crashing, but when a bug makes an incursion into physical reality, it's the next level of scary. Imagine this: one day you're being a Good Computer User and installing updates -- you know, like you're supposed to. In the middle of it all, the device sitting next to you suddenly starts spewing gibberish and malfunctioning as if it was possessed<.
The original version of this story comes from Mark Ferlatte, who worked with Yoz Grahame, co-author of this post, at Linden Lab - the company that runs the Second Life virtual world. There are enough bizarre bug stories from Second Life to fill a book, but this one doesn't involve avatars, digital pets or other mind-boggling virtual distortions. It's from the earliest days of the service, back in 2003, and it's a miniature tour through the history of personal computing that's as wide as it is brief, where a set of old and new technologies worked together in exactly the wrong way.
The Second Life desktop app for Windows had a built-in updater that worked in the simplest way possible: it would download an updater program from a pre-set URL on the Second Life website, save it to the user's hard drive as updater.exe, and run it.One day, for reasons lost in time, requests to that URL returned a 404 Not Found page. Perhaps the server was misconfigured, or the file was deleted… who knows? These things happen to every website. In most situations like this, the user trying to access that page is a human who gets annoyed and does something else, and that's the end of the story. But when the user is a program, the outcome is less predictable.In this case, the desktop app didn't check what it was getting. It just downloaded the 404 page, saved it as a program updater.exe and ran it. Or rather, it used a system call to tell Windows to run it.Windows looked at updater.exe and immediately saw that it wasn't in the standard format for .exe files. But instead of just quitting with an error, this system call decided to try harder. Windows is known for its dedication to legacy support: you can still run programs from back in the MS-DOS days, ones that had the suffix .COM. Unfortunately, this system call didn't check what it was getting, and ran updater.exe as if it was a .COM program.Binary data is binary data, and can be interpreted in infinite ways. It seems that - for entirely accidental reasons - when the Second Life website's 404 page is interpreted as a .COM file, it can be read as a set of instructions to open the computer's LPT device and yell gibberish into it. Which is what it did.Those of you old enough to have used MS-DOS may remember that LPT is the route to the printer. Windows, once again, went the extra mile to help legacy DOS software work. It received the gibberish spewed by the 404-page-run-as-a-program, didn't check what it was getting (you may be sensing a theme here), and helpfully sent it along to whatever printer it knew about.Fortunately, plenty of printers do check what they're getting! Unfortunately, that still leaves plenty that don't, especially if they're cheap inkjet printers from the 90s. And those printers went wild. In Mark's own words: “... they would freak out, spew paper, and in one case, physically break.”
There are lessons here for us all. If you write software, make sure that it validates its data before attempting to use it. (No, really.) And remember this story the next time that you witness unexplainable behavior in your technology — who knows what ancient spells may have been invoked? 🔮
Day Three: the “Like It's Going Out Of Style” error
Etsy is a magical place for home-grown style. Each seller may only have a few products, but with over 3 million of them, selling to 60 million buyers, that's a lot of traffic. Imagine how excited you might be, as a young engineer, to work on such a popular site! Imagine how eagerly you'd make a change that goes live in your very first week - even if it's just a tiny change, like removing some redundant style sheets. Now imagine how you'd feel if that tiny change brought the whole site down…
It was a dark and stormy night, and all was still in the Etsy office… save for an eager developer trying to ship a change on their first week at the company. They were deleting a harmless, unused style sheet — originally needed to support the ancient IE6 — and all the files that included it. Tested out on staging, everything looks good! Tests pass! Time to ship it.But lo and behold, there was a bug! A race condition on deploy meant that some of the updated pages weren't deployed properly, and the old versions tried to include the now non-existent CSS file, which threw an error. This wouldn't have been so bad, except the error page also included the aforementioned, not-so-harmless CSS file. So errors begat errors begat errors, resulting in a hard loop on every Apache process on every box.The looping Apache processes locked up the servers so hard that they couldn't be restarted remotely. Some poor Etsy employee had to go to the datacenter and reboot the servers manually.
Fortunately for the new developer, things turned out fine. Etsy's engineering department has a positive attitude to learning from mistakes. They even hand out an annual award for the most surprising error: a three-armed sweater, knitted (of course) by an Etsy seller. I wonder if that poor developer will wear it for Halloween?
Day Four: the “Payments Not Found” error 🤑
For Throwback Thursday, this bug comes to you from about 20 years ago, told by someone who wishes to remain anonymous. This particular engineer was new at coding, and didn't have a lot of formal training. “Tests are for suckas, my code compiles, so everything works” was the common joke amongst this engineering team. The young engineer followed suit, but still worried about things that might go wrong, especially during testing. Unfortunately, while they were thoughtful about some problems, they forgot to worry about others…
The task was to set up an e-commerce store that sold industrial parts for a mine site, think Grainger catalog, but for civic corporations. The development team came together, split up the tasks and went to work building. Pretty straight forward project; a catalogue of items, authentication and a checkout with credit card. The team finished the project and felt pretty good about it. They shipped it and were proud that the project went so smoothly.Fast forward 18 months after shipping the website, when a bug was discovered.The developer was worried about real charges being incurred while the credit card payments were being tested. So their code checked if the environment was set to in which case, regardless of the amount calculated at checkout, no actual transactions would be sent to the credit card processor. When the project was shipped, that environment flag was never changed to , so even though everything looked good on the customer end for adding items to their cart, checking out and receiving the invoice for the correct amount, the credit card was never charged for the said amount. Instead, it consistently charged the customer a whole amount of ZERO dollars.By the time this was discovered, they estimated it was something like 2 million dollars of goods that were shipped and sold, with no payments collected.
Fortunately for the developer, the company website was primarily selling to other internal companies, so the missing money ended up being mostly an accounting issue. However, you can be sure that they not only learned the value of tests, but of testing the production environment too!
Day Five: the “Black Hole Sun” error
This comes from another LaunchDarkly Developer Advocate, Dawn Parzych. She wears many hats, one of them being a solution architect. Let's dive in to see some of the problems we encounter when working with hardware and software!
10 years ago, I was working for F5 Networks as a Solution Architect. Our hardware had been deployed at an ISP to improve web performance for mobile users. Every day at approximately the same time the internet stopped working for about 5 minutes, and system processes were restarting. I was sent on-site for the day to troubleshoot. As the day wore on I realized the quick day trip was turning into an overnight. All diagnostic and troubleshooting was showing no problems, the problem occurred at 3am. I had to be on-site to see what was going on.A last minute hotel was booked, and I went shopping for a change of clothes. I got a few hours sleep so I could be on-line at the witching hour to see what was wrong. Sure enough as expected, the system failed. Within 5 minutes everything was working again. Everything was fine with the hardware. I went sleuthing on the internet, armed with the timestamps of when the failures were occurring. None of the standard searches were resulting in a solution.Finally, I had an epiphany! Eureka! With some more in-depth research, I realized the problem. The outages were aligned with reported solar flares in the region. The flares were knocking out satellite connectivity taking our systems offline. When the flare passed the systems came back on-line.
We started the 10 Days of Errors with lunar interference, and now we have solar chaos too. Who needs alien invasions when the celestial bodies themselves have it in for us? We hereby request that the universe just calm down a bit. Meanwhile, we'll be in our bunker. 👽
Day Six: the “Bad Case of the Mondays” error
Friends, we made it through the first five Days of Errors! We hope you had a pleasant weekend, and that you were able to keep the terrifying fragility and unpredictability of software bugs away from your attention for long enough to achieve something resembling relaxation. Or maybe you lost Sunday to helping your kid install twenty-three wildly-conflicting Minecraft mods, followed by a night of fitful dreams filled with Java tracebacks.
Either way, give it up: It's Monday again, when we head back to our usual 9-to-5 battles with enterprise technology. This story from Adam Kalsey seemed appropriate to share. It's set a decade ago, back when he was building a worldwide instant messaging service using the XMPP standard. (If you don't know what XMPP is, congratulations.)
As IMified grew, most Monday mornings we'd have an unexplained crash of our core XMPP server. We could see there was lots of network traffic around 9am PST, but it was encrypted and commercial XMPP servers didn't give tools to analyze at any level of detail.Solution? We built our own XMPP server to instrument and analyze the traffic. And then we saw the source of the traffic. Presence packets.9am PST is a magical time: West Coast workers sign on, East Coasters leave for lunch, Europeans leave work for the day, and finally, it's bedtime in India. For an instant messaging platform, this is the time when the most people change state around the world.So the question remained: why Monday? We got just as many packets other weekdays, but spread over a larger amount of time. Our theory was that people seemed a bit more likely to be on time for work on Mondays. So to help combat this traffic spike, we made our XMPP server fork presence packets to another server that just served a static bit of xml for the XMPP packet. Voila! We were online and available. Always.
Oh, is he done? Sorry, we went into a fugue state at the first mention of timezones. (A feeling familiar to anyone who's had to debug calendaring code.) Mix that up with traffic spikes happening first thing on a Monday morning, and I'm feeling a dose of terror that beats a triple-espresso. No way are we falling asleep any time soon.
Thank you for the story, Adam. May your Mondays be calmer. ☕️
Join us tomorrow for Day Seven's 'orrible error!
This post was co-authored by Ramon Niebla, Software Engineer at Rollbar.