Josh Wills is an engineer on Slack's Search, Learning, and Intelligence Team helping to build the company's search and machine learning infrastructure. He is a recovering manager, having most recently built and led Slack's data engineering team as well as data science and engineering teams at Cloudera and Google. He is also a member of the Apache Software Foundation, the founder of the Apache Crunch project, and a co-author of O'Reilly's Advanced Analytics with Spark.
Heidi Waterhouse: So without any further stalling or housekeeping, our next speaker is Josh Wills, and he's bringing us the insight of how Slack is using feature flags. And it's really exciting to us to talk to people who have built their own feature flagging solution because it gives us an idea of what people are looking for and how we can go ahead and address their questions, and what would make our product more valuable. So every time I meet someone who is using their own solution, I'm excited because I'm like, "Tell me how that's working for you. I'm super interested in how doing this. How's your architecture, how's your scaling? Tell me about it." So I'm looking forward to hearing Josh talk about how several people are launching at Slack. So thank you. Josh Wills: She's gone now. Cool. All right, thanks. Hi everybody. How's it going? Everyone can hear me okay? Cool. Let's roll. All right. Practice pushing the button really hard, and it works. Excellent. A little bit about me. I've been at Slack in one form or another since 2015. Initially, I was the director of data engineering. I was the first director of data engineering, hired the initial team, built a lot of the early infrastructure. Realized after a couple of years that management was not really my jam, basically. So, I'm kind of a recovery manager. There's old joke about law school, which is that it's a pie eating contest were first prize is more pie. Management kind of works the same way, like kind of the higher you go in the management hierarchy, the more like management you have to do. Before that, I worked at a company called Cloudera where I was director of data science. That was pretty fun. Just went around and talked about data science and Hadoop and big data and all kinds of fun stuff like that. And long before that, I was an engineer at Google. I spent my first couple of years there working on the ad auction. So if you ever did a search on Google and ad showed up, that was me, figuring out where ad should go, how much people should pay for them, that kind of thing. You're welcome. No one ever thanks me for that. After that, I spent a couple of years working on Google's infrastructure in particular, and most fun for me. I worked on Google's feature flag system, which is called the experiment framework. I wrote the Java version of it. It is, as I understand it still in use. I actually had a friend check for me the other day and there were still six to do's with my name next to them in the Google code base to this day. So, I'll have to go back and fix those I guess at some point pretty soon. And now anyway, I work at Slack is an engineer, and I work on search infrastructure, I work on kind of machine learning infrastructure, and I work on our feature flag system just because it's super fun. Before I go any further, does anyone here not know what Slack is? Any Googlers in the audience or something like that? I literally actually once bumped into a friend from Google and told him I worked at Slack, and he thought I was talking about the Stanford linear accelerator. I kid you not. So, it was very kind of hermetically sealed environment they keep those engineers in. Anyway, Slack is a group sort of collaboration thing. It's like IRC but prettier and works on your phone. It's like Kafka, but for people, I don't know, just like sort of analogies like that. That is Slack. Along those lines, anyone not know what Google is? Just wanted to check. Everyone knows what Google is. It's a search engine. You go there for search for stuff, they do searches. So yeah. Anyway. So, way back in the day when I was first hired at Google, I was actually hired to be a statistician. I was not a software engineer. I was not anything remotely close to being a software engineer. And so I was for many years what I think people would call a data scientist, as someone who is better at statistics than any software engineer and better at software engineering than any statistician, and that described me for a long time. And then by accident, I just kind of got like a little bit too good at writing software. And for like most of my career out here in San Francisco, I have been working on either as a user, or a like power user, or an author of like feature flag experiment driven systems. Basically everything good that has ever happened to me in my life is as a result of moving to San Francisco, meeting my wife, and working on feature flags. I kid you not, I love this stuff. It is so much fun to work on. And I as like a sort of generally introverted, not particularly politically savvy person, do not generally have a ton of influence over like company culture and direction, except insofar as introducing feature flags and improving experiment driven, data driven decision making is like the most impactful way I've ever like been able to change the culture of a company or like the places I work. I've been very fortunate I think to always worked at places that cared about like speed of development. They cared about data driven decision making, and they cared about this stuff just as much as I did. And so, I love the LaunchDarkly folks for going out into the world and like preaching this Gospel to the heathens and the unconverted, and bringing them over to the way of truth and righteousness. So this is like Yeoman's work, and I really appreciate it because it means that I don't have to do it at whatever company I go work at next. Virtues of this stuff, virtues of feature flag, continuous deployment. At Slack, I worked with a guy named Paul Hammond. Who, if you don't know Paul, he and a guy named John Allspaw gave a talk at the velocity conference in 2009, which is called 10 Deploys Per Day, which was how they did essentially like continuous-ish deployment at Flicker back in 2009, and feature flags was an integral part of that system. For context in 2009, I worked at Google, we deployed the ad server once a week, was a weekly deploy for the ad server, and that was honestly considered pretty good at the time. I think Gmail managed to get a deploy out in a month in 2009. That was like a good month for them. If they could like get it done in like under four weeks, that was a big deal. Etsy wrote recently about doing, or I guess not that recently. I'm old now, 2014 about doing 50 plus deploys per day. And as I understand it, Amazon has deployed to production like six times since I started the sentence basically right now. Things have gotten kind of like out of control in some ways. Slack has been an interesting experience. We obviously started with like a lot of Flicker folks. If you had picked up a Flicker engineer from 2006, dropped them at Slack in 2016, they pretty much would have been able to be productive in about 15 minutes or so. I mean you have to like explain to them that it's a new decade, and the horrible election stuff. But aside from that, they would have been able to just kind of crank. It's a very similar system, the folks who built that kind of continuous deployment infrastructure at Flicker built the same continuous deployment infrastructure at Slack. For a little while there, we were deploying Slack upwards of like 120 times a day. That turned out to be in some ways sort of nuts. We've kind of since then rope that back, and we'll talk a little bit more about that. But that's kind of generally like how we operate and how we build. If you don't do continuous deploy, if it's not your jam, what industry are you in so I can start something that's competing with you basically? Because I'm going to beat the ever-living, anyway. Continuous deploy is the one and only way. The other thing I love about doing feature flag stuff is in simplifying and making power users more productive and more effective in managing roll outs, you simultaneously often manage to democratize development and democratize like the fun exploratory aspects of software engineering for a much broader class of people. I was a primary beneficiary of that. Like the fact that Google had this extensive, powerful experiment driven system allowed me a data sciences to just kind of found out that there was this thing called Quicksort to be able to like work on like one of in many ways the most powerful auction system in the world, which is like Google's ad auction. I could like parameterize these spaces, explore different combinations of machine learning models, pricing rules, signals, all this kind of stuff. Primarily through like configuration files and visualization tools without really having to write code was being able to push stuff like every 15 minutes. It was incredibly like intoxicating and powerful to be able to work this way. I understand that we even let product managers launched features now sometimes using feature flags systems, which is to me honestly somewhat terrifying, but I understand it makes the product managers happy, ideally distracts them from Jira for a little bit. So, they deserve a little happiness in their lives too. And then finally like last but not least, when you combine feature flag driven development, the thing I loved at Google, honestly swear to God more than anything else, if you wanted to launch a feature, if you wanted to launch an experiment at Google to like 0.1% of the world, which is like easily a few million people, even back in 2009, all you had to do was convince one other engineer that it was a good idea. That was it. One other engineer approved this code review, approved this change to our configuration system, deploy it to a million people. That was all it took. And then your experiment, your idea was evaluated using exactly the same metrics that we evaluated every single other thing we did. And if the idea worked, if it was like making Google more money, and making better money for advertisers, making users happier, it launched. That was the way it worked. That was how we cranked, and I loved working that way. Anyone ever heard of HiPPOs? Highest paid person's opinion? Are there any HiPPOs in the room right now? There might actually be some in here. So apologies to the HiPPOs. I hate HiPPOs. I hate them. Highest paid people are not generally in my opinion much smarter than like the 22 year old who just graduated from college. Metric driven decision making, continuous deployment environments, environments where anyone's idea can shift and become a production feature, this is where I love to work. This is the stuff that makes me happy. So yeah. We'll talk a little bit about kind of the design space of feature flags. And this is informed by like working on Slack's feature flagging system, and also working on Google's experiment driven system. I use these two terms sort of interchangeably. Experimenting versus launching. Turns out if you write like an AB testing library, like an experimentation library, odds are it'll probably be pretty good at launching features. If you have a feature launching system, odds are it'll probably be pretty good for running experiments. I've seen a lot of like feature launching system evolve into experiment libraries. I have seen a lot of experiment libraries evolve into like feature launching frameworks. There's this sort of kind of nice core duality at the center of these two systems. And a lot of times kind of which one comes first sort of depends on like which company you're working for and who the primary audience for the system is. At the Goog, everything is about data. Everything is about data, everything's about machine learning. Everything is about science, almost to a fault, actually, no, not even almost, to a fault, it's really about science, and the problem was like not sort of the power of Google's experiment framework. It was incredibly powerful, but it also kind of in some ways required like a PhD in statistics to sort of understand like how to launch a feature, which seems like a little bit like overkill in some ways. It's incredibly powerful to have that when you're developing machine learning models, ranking algorithms, all this kind of stuff. Like having that power at your disposal is incredible. At the same time like Joe Engineer who just wants to like change the color of a button or something like generally speaking should not have to go through like all these sort of statistical I mean rigamarole for lack of a more technical term in order to turn something on. Simultaneously like feature flagging systems. If you work at a company where you're building like a spreadsheet or a fairly well defined application that's not fundamentally about machine learning, it really can just be about just turn the feature on or turn it off and just kind of go. Focus on the engineer productivity. And those systems are good too. And they can be evolved to do more sophisticated things. But often I have found that as I found at Slack as they grow, they kind of tend to run into like a fairly predictable set of problems, and I'll get into that right now. So, I want to break down kind of like the feature flagging systems I've worked on into two sort of general areas. One is like the design of the library itself. Like the thing it's doing the calculations on any requests to decide if a value should be true or false or what the value of a number or a string should be, that kind of thing versus the actual like deployment infrastructure. So first and foremost, any kind of experimentation feature flagging library is pretty much about names and values. That's like the core of it. There is a string or some other identifier that maps to some value. Generally speaking, Boolean flags true or false are great. But when you're doing experiments, you oftentimes want to do things that a little bit richer, a little bit more complicated than just true or false. So strings are a good option there. You can have like a multivariate treatment. You can have three or four different treatments. If you're doing machine learning, a lot of times you have thresholds and rules and sort of like combinations and mixture models for combining different things together. When you're doing that, floating point values come in to be like really handy. And so being able to do experiments and configuration on sort of more complicated entities becomes progressively more useful. At Google, we even kind of went so far as actually having like protocol buffers as a like experimental value in some sense. You could configure a protocol buffer in a configuration file. That could be then be passed as better RPC parameter or other sorts of things like that. It kind of got a little out of control after a while. But nonetheless, when you're designing these systems, if you find yourself as a developer going with like a Boolean oriented system, that is great, that is a phenomenally good place to start. But don't stop there. Don't over optimize for the Boolean case to the point that you preclude yourself from being able to work with strings, floating points, whatever if and when data scientists show up at your company and want to start doing machine learning, that's like thing one. Thing two, conditions and modifiers. And feature flag that you can just turn on or off is fine and is very useful. But generally speaking, a lot of the richness and the power comes out of having a large set of conditions and rules that can be used to like determine the state of a particular flag. Generally two kinds of these conditions. There are request independent conditions. What is my host name? Am I in the Dev environment or the Canary environment or the prod environment? Things that don't vary from one request to one request that sort of a request independent condition. And then of course there's the request dependent stuff. Which user is this? What is their identifier? What team are they on? What country are they from? All these kinds of things. The trick to making these feature flags performance is basically caching as much of the request independent computation as humanly possible. Such that when a individual requests comes in, ideally speaking, figuring out like what all of the feature flag values are for that request is like basically a hash look up. Like it ideally requires like no more work than that. That is effectively like the goal of the systems I work on to make them sufficiently fast at scale. Google has got to be on the order of like tens of thousands of like features that are being active at any given a request, Slack and a much smaller scale company still has about 1,000 or so features that are kind of in some way involved in a request. Doing that much calculation without caching is just a recipe for sadness more or less. Common anti-pattern that I ran into. We ran into this at Google, and we ended up fixing it. I am living through this kind of a repeat of this nightmare at Slack right now. You incrementally add little modifiers and conditional rules over time. Is it the Dev environment or the Canary Environment? Is it an enterprise team or a free team or a paid team? Is it this country? Is it this host name? Is it this hash of the team ID? Whatever it is. Like all these kinds of different things. And then when you look at your configuration file, and you see all the different little rules and modifiers applied to a given flag, how do you know like what the value of this thing is actually going to be? And then if you need to actually like reordered the rules in some way, how exactly do you do that? At the Goog, they came up with this like fairly clever system where you actually would explicitly list out the conditions and modifiers that you wanted to apply to a given flag with these like honestly, fairly sophisticated set of override modifier kind of rules to make it like very abundantly clear how exactly the value of this flag would be set in any kind of given context. It was sort of like Java itself, very verbose. Sort of a pain to write, but kind of nice to read, broadly speaking. All right. And then finally the sort of like the final element of this, some element of randomness and in my mind the kind of fundamental distinction between a feature flag driven system and an experiment driven system is the level the randomness operates at. So if I'm doing feature flags, I generally want some kind of controlled rollout, right? I want to roll out to 10% of my users, 20%, blah blah blah, all the way up to 100. That randomness is usually kind of treated on like an individual flag level. An individual flag can have like some kind of randomness value associated with it to control what fraction of users it will be on for. In an experiment framework, and in Google's experiment framework, it was like individual sort of like control parameters, individual flags could be modified kind of en masse together. Like there was essentially like a parameter space. There were strings, there were numbers, there were Boolean values, and doing an experiment involve overriding all of their values at once. So kind of like flags existed in these things like called layers, which are basically groups of related flags that could all be changed together and allowed us to do like much more sophisticated, much more powerful experiment stuff again at the expense of cognitive overhead for our poor engineer who just wants to launch a feature. But that for me is like this sort of major difference between these two systems. How does randomness get injected into the system? At what level does randomness exist? The results of all this is that every feature flag system I have ever worked on is fundamentally a domain specific language. It is a DSL through and through. And whether that is going to be expressed in JSON or YAML or whatever, basically a user interface or just a YAML written in code, like that's totally fine too. To express that kind of core concepts of the DSL of what is possible and what is configurable inside your system. And so big fan and like great connoisseur of DSLs. So, that's the library side of things. Let's talk about designing for production. And when we talk about designing for production, what we're fundamentally talking about is designing for failure. We have to think about ... essentially every single thing that can go wrong will go wrong as it always does. Every company I've ever worked at has had a sort of configuration feature flag driven outage at some point. Like fundamentally speaking, I think Slack has had at least two. Google, I once took down the ad system for about 12 hours or so on a Friday. I'm trying to think. I think I may be lost about $2 million, something like that. You lose a lot of money very quickly at Google. I learned a very valuable lesson for that $2 million, I think. Not for nothing. It's almost kind of incomprehensible that you could do that in 2009, but you could, and I did. Anyway. All right, pushing bits. How do you get configuration information into a production environment? I have only ever really done this in one of two ways. Google has a system called Chubby. It's basically ZooKeeper or if you're familiar with ZooKeeper console like that kind of idea. For small configuration, for small sets of experiments flags, we would push things out via chubby. We would like push a file, notify all the observers that were watching it, download the files, do some initialization and kind of off you go. For the bigger configuration system as we use GFS. Important kind of thing for me is basically treating configuration as code from a deployment perspective. However you deploy your code is in my humble opinion the way you should deploy your configuration as well. Again, having been burned multiple times by configuration driven outages, I am like very much inclined to treat configuration as just as dangerous as artifact of as code. Our current system at Slack does a build in Jenkins, like again with the idea of configuration as code, we run tests against all of our configuration, like both sort of validation against the Schema, and the actual functional tests. If we're updating the version of a desktop like binary, does the binary actually exists before we push the configuration file that's going to change where it lives, that kind of thing? We bundle things up, we pushed them out to S3, and then we have a console wash that we notify that then signals to all of the different major binaries to go download a copy of that S3 bundle, unpack it, verify it, and then like send a sort of request to a secret local URL to trigger a reloaded the configuration file, that kind of thing. This is fine. I'll be honest with you. I think I would be like macro happier without console in there at all. And just having like a system that kind of periodically polled S3 to see if a new file was there. I've just been burned by so many systems, so many times. I spent like the first 20 years of my career writing code, and I would spend like the next 20 deleting it, like just as much as possible. So, simplicity, simplicity, simplicity in all things. My desire for new shiny things has kind of fallen by the wayside with my old age. At Slack, when we do deploys, we pushed things to staging. We run a bunch of sanity tests to make sure stuff basically works. Then we dogfood it. Dogfood was basically like Slack for us. Dogfood's been great. Dogfood is caught and prevented a couple of outages for us actually I think in the past month or so. And then we roll things out on a percentage basis, 10%, 25%, so on and so forth until we're out to 100, and we sort of watch metrics the same way. We treat configuration exactly the same way. Configuration is code for all intensive purposes. It is deployed exactly the same way. We checked the same graphs. We make sure everything works in our core web application infrastructure whenever we're doing sort of deploy code. All right. And then last but not least, keeping up with your features. So in the same way that when you push in your binary build, you have like information available to every request about like which version of the build is running this particular request, we have that same exact information for which version of the configuration files are running on an any given a request as well. And then finally, when a feature is actually triggered on request, we fire off a log event. Basically this feature got triggered on this and this request, and then we track any kind of performance metrics downstream. We track a bunch of growth related metrics as you can imagine. And then we track for outages and stuff like that. We look for like if we're turning on a feature for the first time ever, and suddenly a bunch of code paths are dying or if we're ramping something up, and a bunch of code paths are dying, we can detect sort of what features are involved in some sense in things breaking down that way. Pretty cool. Last but not least, let's talk a little about how you evolve your feature flag system. Slack's main heavy duty monolithic web application was originally written in PHP, much like Flicker and Facebook and Wikipedia and Box and a whole bunch of other people. And PHP is as we all know, a terrible language. And while it is a terrible language, it's also and I can't decide if I'm like a convert or if I have Stockholm syndrome or whatever because I worked with C++ in Java and Go for a long time. But here we are. I have come to in some very weird, strange ways appreciate PHP as a development environment. In the sense of, I sort of like the way it constrains things. Like one of the core ideas of PHB is state. Every request starts out with a clean slate. Nothing is initialized beyond the absolute bare minimum to kind of just like bootstrap the request. So you have to initialize all of your global variables every single time, and you have to like ... And the sort of nice thing about this as you never get into a state where you like completely hose yourself by screwing up some global state that destroys every single request on your server. And my C++ days at Google, like when I caused that sort of aforementioned 12 hour outage, I introduced a psych fall, C++ servers don't like psych faults, they just die. A PHP server, it will let you like divide by zero, and it will keep going. That's like it'll do kind of crazy stuff like that in service of keeping a request going, which again in the sense of like fault isolation, like limiting the blast radius of things actually kind of a good thing, broadly speaking, like not terrible. And then you know, it's got a pretty good concurrency model. Like you can like run every request independently in its own thread. Google actually had like infrastructure that lets you run C++ servers in essentially this way. It was kind of funny to me like as I reflected on it, it's kind of like the way Google like forced us to run C++ and Java based services was in many ways the way that like PHP just runs kind of out of the box. It's just in many ways just a better, safer and more efficient way to operate. And then finally, and then just like the absolute virtue for me is programmer workflow. Never having to restart a server, never having to recompile, change the file, fire the request, change the file, fire the requests, being able to iterate that fast. I just, in many ways, in spite of myself really have come to like PHP, God help me. Here I am. All right. This is the framework. If you're Google Fu is sufficiently good, you can actually find this get hub repository in the Internet. If you type in framework into Google, it'll say, "Did you mean framework?" Thank you machine learning model. Good stuff. But this is the framework. It's the Flicker like web framework. And again, this is the like code that I was exposed to you when I first showed up at Slack in my first day back in 2015. And the kind of cool thing about it is that it's like sort of brilliantly brain dead simple. I was at Slack on my second day, and I fixed my first bug, and you know where I fixed the bug? Any guesses? It was in the feature flag library. That's right. Exactly. That was like literally the first place I went. It was a very simple system that made ample and aggressive use of global variables. Again, global variables in the context of PHP are not like quite as bad as they are in other systems because of this like request sort of independent statelessness model. Like it's only global within the context of an individual request. So kind of like who cares. And again, since it emerged from the system they developed at Flicker, it had a feature flag library like built right in. The problem with this is that the feature flag value, the actual result, like the true or false value was stored in a global array. And since it was just in a global array, there was no function call or whatever due to look ups for it. Basically at the start of every single request, you had to initialize all of the feature flags for the requests regardless of what the request was doing. Was it a health check? Who Cares? Initialize all the feature flags. Is it downloading a file? Who cares, initialize all the feature flags. This is again, totally fine and not a big deal when Slack is very simple. When there are like 15 feature flags, and they have a very simple set of rules and users can only be on one team and blah, blah, blah, blah, blah. This is all fine. Nowadays, sorry. It's Freudian. Slack has 1,000 feature flags. I should hope that Slack becomes Google someday. Slack has about 1,000 feature flags. A lot of people are in multiple teams. Enterprise users are in multiple teams. I'm sure many of you are right now as I speak on multiple Slack teams chatting away. The complexity of the logic and the amount of calculations that have to be done in any given request to figure out the value of every single feature flag in the system kept going up and up to the point where a few weeks ago, not a few weeks ago, sorry, a couple of months ago at P50, we would spend about a mil a second just competing feature flags on every single request we did. And there was silence in the room. That was amazing. That was incredible. How did you guys do that? That was awesome. Just like the horror. All right. A couple of years ago we started migrating from PHP to HHVM/hacker. Are folks familiar with this? Facebook built itself on top of PHP. Basically, I'm trying to get like a good analogy. Like they kind of dug the hole too deep and instead of like climbing out, they decided just to keep digging to get to the bottom of the earth. And they wrote their own programming language, and they were throwing runtime for it. And that is HHVM at HACC. And it provides like a large number of kind of nice things, I'm not going to lie. First and foremost, migrating from the PHP like five-ish runtime to HHVM cut our CVU budget in half. That was pretty cool. A lot of money, free money. It gave us sort of proper types. We can sort of evolve our type system over time. We can add types the fields. We get a nice collection library, we get generics, we get lambdas, and we also get a fairly excellent implementation of APC. Anyone know APC? The alternate PHP cash. It's just a cash. And importantly it is one of the few ways in PHB to maintain state that exists kind of across requests. Okay. So anyway, we started migrating off of PHB and kind of on to hack. And as we did this, we took this big array of global sort of feature flag hash value look ups, and we wrapped it all in functions, we had added nice typing for it, and we kind of just generally tried to wean ourselves with the habit of looking things up in global arrays whenever possible. And this was like pretty great. The one thing we did not do though, once we had done all this work to wrap all of our features sort of calls inside of like is this feature enabled functions was actually update the feature calculation engine to no longer initialize every single thing every single time in one giant sort of global state array. And so that was the work that I set out to do a couple of months ago. I was like, "I know these systems well. I can fix this and make this better." Because the funny thing, right, about doing a feature flag sort of like migration, when you're changing the feature flag calculation engine, when there are like thousands of flags in the system gating important features, turning different stuff on is that it is in fact absolutely terrifying to do. And it sort of takes a kind of like somewhat, I don't know, there's a very fine line in my experience between an intern project, and a like senior staff software engineer project. They're both sort of kind of ridiculous ideas. One of them maybe has like slightly more of an opportunity to launch in production than the other one, but like only slightly, it's like pretty close. So, I wrote and modernize the feature fly calculation engine that Slack had written many years ago and did not touch in a couple of years to do all of the individual calculations in kind of two ways. One, for any requests, independent conditional logic, just do all the calculations up front at the time we're loading the config, and just cash it away, and just store it away in APC for the lifetime of the requests. No harm, no foul. For everything else, for every online feature flag calculation we have to do, just calculate the one flag, just calculate the flag at that moment given the environment you're in. That's it. Don't do anything else. And I think like for me, one of like the great jokes of doing this is like okay, how do I feature flag changing the feature flag engine that was more or less what I did? And I did it kind of like the stupid way that you're not supposed to do feature flagging. By like hacking and hard coding in like little random bits, and like checking is the name of the host name matching this pattern and like all of the kind of like, I basically did poor man's dumb feature flagging in order to like re-engineer the feature flag engineer because I couldn't use the feature flag engine to test. You know what I mean? It's fine. It was always kind of funny to me. Like I just saw this sort of pattern over and over again. Like at Google, there is Borg and Borg runs kind of all the cluster management stuff. But the question then is like, well, who monitors Borg and make sure Borg's up? And the answer is the old system, which was called the babysitter. Babysitter was the pre Borg monitoring system, and babysitters still lives monitoring Borg. My favorite version of this by far though was when Google built a new file system called Colossus, which was designed to support big table. Except in order to maintain this state of like where files were located in Colossus, they stored that in big table on top of GFS. Like have you ever seen an image of like the snail running on top of the turtle? It was kind of like that. It was anyway, like Big Table on top of colossus on top of Big Table on top of GFS. Anyway, it's all just the tower of like cards and so on and so forth. Anyway last but not least, simpler, more pleasant, more productive. I rolled out the new feature flag calculation engine for Slack over the week. This was Friday, March 1st to Friday, March 8th. The fine people in our PR and finance and all that kind of stuff wouldn't actually let me put a y axis on here, but you know, these are numbers, and the red numbers are lower than the blue ones, and that's like good, right? That's kind of in generally a positive thing. Little simpler, a little more pleasant, a little more productive just by doing like absolutely like sort of like fairly trivial but also terrifying software engineering. Right? Basically like not doing anything hard but doing it in very scary places. That's kind of the story of my career actually in most ways. If that's the kind of career you would like to have, anyway, then you should have your head examined or go to law school or something. But if you're in spite of that, you still want to do it, Slack is hiring as you can imagine. And I am like more than happy to answer questions if I can for like the next 10 minutes or so. I would be happy to. Anything y'all want to know, you can just holler at me, and I'll answer. Thanks very much. We have a microphone or anything? Is it just like whatever? People can just ... if you just like- Audience Member: I'm doing all right. So, how do you guys deal with automated testing in terms of ... Josh: Yeah. How do we deal with automated testing in terms of our feature flags? Great question. Feature flags are tested. Again, kind of just like everything else. There is like a test suite for them. For a long time, the feature flag system was sort of at the core of like our web application. So, in my Google days, YouTube had a file called config.py. It was YouTube is written in python. Right? So config.py was where YouTube defined all of their feature flags. So, I can have very similar system. It was called config.php. It's a good name for a file. It's like very clear what it does. When you would make a change in there to like ramp up a flag or change a condition or whatever, it would run through the exact same sort of normal code testing suite as everything else. But because that config file is at the center of literally everything, like everything depends on it, it would be the long pole runner of our test suite. It would take about 10 minutes or so for us to run like every single test like touching that particular file. One of the main virtues of moving like the sort of configuration system out of the core web application and into an independent repository was to like make those sort of like tests sane, broadly speaking. So, when you define a new feature flag at Slack, you define it in code. It is like this is me, this is the name of the flag, this is what it does, this is when I promised to remove it, although I probably won't, that kind of thing. And you do unit testing against it just as you would anywhere else. But everything after that, all of the ramping up, all that sort of stuff is all done like in the separate configuration repo with a much lighter sort of test suite associated with it. Again, deploying the config is just like deploying the code. Like the config goes out the staging, sanity test run, the whole automated end to end suite runs. There's a deploy commander basically like keeping an eye on the graphs, making sure things are sane, alerting people if things go by. There's a big list in this cool tool we use called Slack of like every single flag that was changed in this release becomes easy to kind of quickly correlate and figure out what went wrong, that sort of thing. But, yeah, it's definitely like lighter weight code, lighter weight testing because is again very focused to like what exactly is this feature we're going to do. But it is still kind of subjected to the same rigors of like deploy that we do with everything else. Yeah. It's a great question. Thank you. Yeah. I mentioned food, but it's not food through. You have like another speaker after this so yeah, ha-ha, suckers. You don't get any food. Never mind. Anyway. So yeah, please feel free. Anything else? Otherwise, I might get the conference back on time, just kind of what I'm known for. Cool. Thanks you all. I appreciate it. Take it easy. All right.
Ready to Get Started?
Start your free trial or talk to an expert.