Galaxy Talk: The Art of Shipping Broken Code featured image

Welcome to our ongoing series spotlighting different talks from our interactive user conference, Galaxy, which was held in April. 

In North America, Black Friday is best known as the kickstarter to the holiday shopping season, wherein retailers of all varieties offer significant discounts on their goods and services.

In years past, Black Friday might have conjured images of people lining up at big box stores, hoping to score some deals whilst also hopefully avoiding any sort of drama over, say, limited quantities of a new Xbox.

With online shopping skyrocketing in recent years, the more current scene is folks in pajamas sitting at home navigating the same sales via their phones and laptops. In 2020, consumers spent $9 billion on Black Friday, which was up 21.6% year over year, according to data from Adobe Analytics.

For developers, the pivot to online shopping by consumers has put more pressure on software performance in the face of massive traffic surges to ensure e-commerce storefronts don't go down on the biggest shopping day of the year.

That's why it was impressive to hear how feature flags gave StockX the confidence to try something new on a recent Black Friday, while the company was in the midst of experiencing traffic rushes like it had never seen before. StockX is a live marketplace for some of the hottest consumer goods available, from exclusive and limited edition sneakers to streetwear, handbags to watches.

On Black Friday, StockX A/B tested a new home page against an old one to see which would perform better. Check out the full story from StockX's senior software engineer, Kyle Schrade.

Transcript:

Hello, everybody! Welcome to LaunchDarkly Galaxy. I'm here to talk you- talk to you about The Art of Shipping Broken Code. My name is Kyle Schrade, and I'm a senior software engineer at StockX in Detroit. I am a gamer, a biker, somebody who likes GraphQL probably just a little too much, um, and a snowboarder. So some of the stuff you see on the right, pictures I've taken either on my travels going snowboarding in places or, uh, the one in the city is actually out of our office in Detroit when we were back at the office.

My Twitter handle is on here, and so is my Twitch. If you guys want, um, live tweet me- live tweet me during this, uh, I will be responding either in chat and on Twitter, so please, let's- let's talk some of this stuff.

null

Today though, I'm here to talk to you about our home screen story at StockX. Uh, we rewrote our entire home screen into GraphQL. And before we, you know, get into some of this stuff and start tar- talking about feature flags, we should probably touch on what GraphQL is just so when that, you know, those words start getting thrown out, you're not lost. 

GraphQL is really just a replacement for REST APIs, or you know, APIs in general. And we- it switches from you calling that end point and getting everything that end point's gonna give you, if it's over- regardless of what- if it's over fetching or under fetching, to a kind of declarative space where you give it the left side of JSON, where it's just keys, and it'll fill that out and send it back to you in almost the exact same way, unless you do some special stuff, but we don't have to talk about that. This isn't about GraphQL.

Um, but, we did this mainly for two different reasons. We did this to slim down our response sizes because they were way too large, and for the upkeep. We upkeep our graphs, we make sure our graph is right, we don't have to worry about each individual endpoint being right. We have one source of truth, and it makes that really, really nice when upkeeping things.

Um, so, again, we did a A/B test on our home screen, and we used a bunch of feature flags with that. And just some- some notes on that before, our old home screen was a bunch of cron jobs that would fill up some tables, and then our- we had an AWS Lambda that would read those tables, and then send it back to the clients. The p95 for this was about 700 milliseconds, in that ballpark, and our new home screen was powered by GraphQL, and it resolved all the same data, and it's p95 was about 250 milliseconds or less. And we did a ton of requests. So, we were at that point doing 470, or 47,000 requests a minute, we're well over that now.

null

But I know people are gonna yell at me, "Hey, you know, you can't just shave off time like that, you can't just lose 450 milliseconds." You're right. GraphQL is not a magic bullet, it does not solve every problem. What it did, though, for us at least, is it gave us way better caching, and as well as we're just transferring you less data. So before, if I tried to paste that home screen response into this slide deck, it would crash my Google Chrome, and you guys wouldn't be able to see any of this fun stuff we're talking about. Now, I can easily paste it in here and it's very easy to do.

So, we use two features here, or two types of feature flags. We had a kill switch, so if something went wrong, we could actually turn that off and just go right back to where we were the day before. And a traffic shifter to just shift the traffic from one side to the other side in whatever percentages we want. So we did 50/50, and by the end of the day we actually ramped it up more.

Um, and of course, we had, you know, big intentions, we were gonna do this, we had, you know, this new technology, what better day to showcase it than Black Friday? You know, like the biggest day of the year that we hopefully don't go down on and have all of this, you know, great user experiences. Um, but, as you know, everything that we see, you put it out, sometimes it just doesn't- doesn't work out. 

null

So, we put it out, the first day we put it out, we actually had to turn it off. We had to use that kill switch, we had to, you know, click the button and put it to 0% because there was a back end problem. Um, on the bright side, we were able to fix it, but there was a back end problem, we did have to kill it, we did have to roll that back to, you know, what the- it was the day before that day, a week before Black Friday. But because we love punishment sometimes, we turned it right back on a week later on Black Friday, the biggest day of the year, with the most traffic we have had until that point, which is a very scary thing.

But before we go too far forward, let's talk a little bit about what broken code is. Broken code is really stuff that if you turned it on, or you were using that section of code on that if block, or wild block, or whatever it is, it would have crashed your apps in some way, or made it unusable for whatever end user you're looking at. So, that's an infinite spinner, a white screen, you know, like, it- the app just crashes, all of that stuff.

null

Uh, and we ship a lot of this stuff behind a kill switch. The hardest problem with a lot of our mobile apps is getting people to download it. And if we can get you to download our code that's gonna work in the future, but you know, may not work today, that's a huge win for us. And we can ship this, you know, weeks, months, years, if we ever have something that long, behind these feature flags, and then turn them on when- whenever the other teams are ready in the back.

So, the other thing is, we should probably stop calling them kill switches and more of like safety nets, because a kill switch kind of means like, "Hey, something went really wrong, we need to kill this today." Uh, and that's not really how they're used a lot of times. It can be really easy to detach things from the frontend and backend by putting in a- a kill switch or tr- or a safety net, to just not have that call be made. And that's a very big distinction because you're no longer killing something, you're kind of saving yourself from- from turning on a bad experience to your end users. And you can send that out wee-, again, weeks, months, years, if you really want to, beforehand, until the back end is ready.

And now let's go back to the A/B test, uh, and what we did before Black Friday. Uh, again, we turned it on on Black Friday, which is a crazy thing to think of today. And what ended up happening is our new stuff actually started to outperform our old stuff and cause less problems for everything [inaudible 00:06:12]. Our new caching was overly efficient in the fact that we were just caching things better. We had way more hits than we had misses compared to the old stuff, which was huge. And by the end of the day, we actually turned on more people through that, oh, uh, that new home screen. 

null

And If you would've asked me the day before, you know, when we had a kill switch, and we had problems, and everything, that's a huge win that you could turn it on from our seat in Detroit and have all of this stuff work and not have things have problems like it did before. And then, also, just to, you know, gloat a little bit, the new home screen won out by a very, very wide margin, based on people actually buying product. So, that new home screen was so nice that people actually bought more shoes and more items based on just a nicer home screen. It was really nice to see.

But a big part of this is what can you take away from this talk? And the big part about that is feature flags should be... everything you ship should have some type of kill switch or safety net. If you ship something on a, especially on a mobile client, it's really hard to get people to update their stuff. So if you ship that out there and it doesn't work, they have to go then download your app again to see, you know, "Did they fix this bug? Can I do this thing? Does all of that work?" 

And that's really hard to do when you can't just turn it off remotely. You mean- you're really hoping for your- your end users to go do something to really help you out. And I'm sure everybody knows, like, end users are not like, "Oh, yeah, this works, let me go download the next thing." It- somet- a lot of times it just doesn't happen. So, shipping with that safety net is a huge, huge, huge win when you can, you know, if you see something bad, you can turn it right off.

Uh, you can also ship things to people early. Again, like, instead of turning it off once they get it, you could have it on their device, so you don't have to have that big app, uh, increase of, "Hey, how many people are downloading it? How many times do we have this known version out there? Is it there?" Uh, you can actually get that on their device early, way early, and then you have this adoption before you even turn it on, which is a huge win. And you can even l-... I- I put a joke in here, you can even launch it darkly, which is- there's a good one.

Um, all of this works. We have done some huge things as StockX like this. Again, we rewrote our home screen on all of our apps saying, "Hey, you know, it's behind this feature flag, it's behind this kill switch." And we did it on the biggest day of the year where we make a, you know, a lot more than we do on a normal day, we have extreme amounts of traffic compared to that. And it worked. It worked great. So, again, it works. It's been proven by, at least by us, so at- take my word for it, it's a huge win when you can do things like this. 

Um, and the biggest thing here that maybe isn't apparent to everybody is you can really split apart your stuff here. So, instead of frontend waiting for backend to be ready to ship something, and then you have this adoption period, and you hope everybody has it working, and all of that stuff, you can actually ship this way early, and you can decouple your frontend from the backend, as long as they have a contract that they're both okay with. 

So if the frontend and the backend agree, frontend can, uh, you know, finish weeks, months, years ahead of whatever the backend is ready. And then once the backend is ready, you can flip that, you know, that safety net, turn it on, and everybody's stuff start working again, and you don't have to worry about these, you know, interdependencies of how do you poise stuff as much, now it's how do you turn on your feature flags, instead of how do you get in on people's devices, which is a lot simpler and a lot safer, because you can turn those off a lot easier than having a, you know, an end user undownload (laughs) your code or uninstall your app.

null

So, parting remarks before I- I get out of here for you, StockX is hiring. Uh, if you go to StockX, scroll down to the bottom, there's a job section, click on that, we have a bunch of openings right now, please come- come see us. 

Uh, we also have a tech blog. Tech blog came out pretty recently, you- you'll see a lot of GraphQL stuff. We're getting some LaunchDarkly things, we're getting some more stuff, but we- I wrote some stuff for GraphQL and it's out there right now, it's the easy published stuff, it's also on Apollo's website. 

Uh, if you have time today, I'm also speaking at Apollo Summit. I'm talking more about schema and nullability, and some awesome stuff of how you can live in the future, but if you don't have that today, uh, that extra time today, that's completely fine. 

Uh, again, where you can find me, I'm on Twitter @NotKyleSchrade. I'm on Twitch; I try to stream once a week. And that's all I have for you guys. Thank you so much for listening to my talk. If you have any questions or anything, please stick around for some live Q&A, or just tweet at me. I always have my phone on me for the most part, and I will respond as much as you guys need.

From launching LaunchDarkly at large organizations to supercharging your release pipeline, check out the rest of the talks from Galaxy. 

Related Content

More about Galaxy

June 3, 2021