Right Grid
  • Overview
  • Transcript
Trajectory

Not Just Buttons: Feature Flagging Your Machine Learning Architectures

Vijay Ramesh Demandbase

Feature flagging is common practice in the UI and front end layers of modern web-based applications. The ability of feature flags to allow selective roll-out of new UX opens many more doors in the world of product possibility - and the benefit of using a service like Launch Darkly to enable this is easy to see. At Demandbase, while we initially used feature flagging to control what our customers see in our platform, we soon saw the potential for a similar paradigm to impact backend APIs and even our machine learning modeling itself. In this talk, you’ll learn a bit more of our history with feature flagging and LaunchDarkly, and how this led us to explore using feature flags to control production machine learning workflows on Spark. In doing so, we were able to experiment faster - accelerating our learnings in a real production environment. More than just supporting R&D though, by bringing feature flagging into our machine learning workflows, we are able to deliver more value more quickly to our customers, in a safer and more robust way than ever before.

Downloads slides

Vijay Ramesh

Vijay has been building software professionally for 15 years, focusing mostly on full-stack and backend development for the first half of his career before transitioning into data engineering and applied data science roles. For the past few years he has also led data science teams at Change.org and now at Demandbase. Tackling problems in NLP and Machine Learning in large production systems has been a great passion of his, and helping others to learn from his experience (and learning from theirs) is incredibly important to him. Additionally, with an academic background in philosophy (of the non-analytic, Continental variety--think more Hegel and less Frege), Vijay feels he approaches many of these sorts of problems differently than many of his peers.

Heidi Waterhouse: So our next talk is by Vijay, who is also a data scientist, and we learned this morning that there are a lot of different ways to be a data scientist. Vijay is one who works in natural language processing, which I find fascinating mostly because internet is actually a valid answer for things. Why are things terrible? Because internet. Which seems like it might be ungrammatical, but is in fact a full explanation of what's going on. So he's going to talk to us about how things are not just buttons, and how feature flagging and machine learning could actually be best friends if you think through the implications and how you're going to do it. So, please welcome Vijay.

Vijay Ramesh: Hey, everybody. All right. Thank you all for coming. I hope you've all been having a good day. I have so far. Super beautiful location, and thanks to LaunchDarkly for getting this all going. So yeah, like she said, my talk today, I'm gonna talk about how at DemandBase we've been using feature flagging and we'd been using LaunchDarkly not just in the UI layer, not just in the hide and rollout, selectively deploy features. But actually in production machine learning workflows. So to test out changes, to run experiments, to test scaling, things like that.

There we go. So first, before I begin, a bit about me. This slide is actually a couple of years old, so now I've got maybe 17+ years engineering. I started as a full stack engineer. The last seven or eight years or so I've gotten a lot into data engineering, applied NLP, applied machine learning. So less on the math side of the house and more on the, well how do we build a scalable, monitorable, resilient production system around these models?

My background has nothing at all to do with computer science. Studied academic philosophy, focusing contemporary continental and critical theory and things like this. So if anyone afterwards wants to discuss Hegel to Havermoss, I'm all about it. And there's my GitHub if you want to look me up.

All right. So just a brief agenda. So before we get into the fun machine learning stuff, I want to tell a bit of my own story with feature flagging, with doing multivariate testing, and with systems like this. So first, I'm going to talk a bit about a system that we built when I was a data engineer and eventually ran the data science team at Change.org, and this was in the pre-LaunchDarkly days. So this would've been about 2013, I think, when we started architecting out the system.

Then I want to talk a little bit about how feature flagging and your strategies there are different in the B2B world versus the B2C world. So DemandBase, if anyone's not familiar, we're a B2B, very data science heavy, marketing, sales, advertising company. And so, moving into that space from the B2C world of Change.org, there was a lot of changes to the way that I started to think about, well what's the purpose of feature flags? How do we actually want to run tests? What do we want to learn?

And then, finally I'll get into the fun stuff. I'm going to show some examples where in production today, DemandBase, we're using feature flagging in our machine learning workflows.

All right. So back to the pre-LaunchDarkly days. So I moved to the Bay area in 2012 to work for Change.org, and shortly after I started we created a data science team. So I was data engineer one of two on this initial data science team. If anyone's not familiar, it's a petition platform. Hundreds of millions of users. There's no strict hierarchy for users. There's not this idea in the B2B world where you have users all under a customer account. So here it's all behavioral. Users are grouped by the petitions they sign, the things they share, and so you end up with groupings that are natural and behavioral. Here's users who are really interested in animal rights and here's users who are really interested in labor rights, and maybe environmental rights is somewhere in the middle of there.

But unlike the users, the organization itself was structured in these geographic verticals. So campaigns teams, communications teams, marketing teams, business development, all that were grouped into regions, and then countries under there. And what all this means is, when we were starting to think about, how do we want to do experimentation? How do we want to do multivariate testing? How do we want to do deploys where we can roll things back out and selectively release features? There was no one size fits all heuristic. So what performs well for users in Indonesia is going to be very different than what performs well for users in California. And the strategies that these country teams wanted to take were really different country by country.

So we sat in a room. This would have been the end of 2012, maybe the very beginning of 2013, and we started talking about, well we need a multivariate testing system. What all does it need to do for us? One of the things we really realized is consistent user experience. So across different users we could try out a bunch of different variations. Maybe we were going to ... You sign a petition and then you're going to see some upsell and then go onto your next action, and maybe we want to figure out, well, what should that upsell be and what should the next action be? And how should these forms look?

Upsell be and what should the next action be and how should these forms look. For the same user to come back to this platform again and again and start experiencing different things is very jarring. Also, from the point of view of statistical soundness. If you're trying to run experiments and users are not deterministically treated into a particular group, so they're going to see it as variation A or variation B, when you go back and you try to analyze what was the impact of this experiment, it becomes really, really messy.

The idea we had was, we want users to get deterministically put into a group. Say the control group or variation A. Over time we want the system to be able to expand at the variations. If variation A is doing really well, we want to add 10 more percent of traffic, 20 more percent of traffic and you can eat out of this control space. But the kind of contract that we made was well, we'll never eat into a treated variation space. Once I've seen this experimental flow, unless we restart the experiment, I'm always going to see that experimental flow.

This is all about making sure the data underneath that, we weren't confounding different factors, so we wouldn't leak bias into our results. Also, we needed to build real time management. We wanted a tool that product could go and set up. It would not be engineers deploying JSON configs, it would not require servers restarting. It would not require chevrons or anything like this that a product manager could sit there and say okay, I'm going to look at the results for this experiment and now for this week I'm going to turn off this variation because it's horrible and turn this other one up twice as large.

Then finally visibility. Any system like this, if you can understand what's the state of things, what users are seeing, what impact does it have in all of that, it's going to be meaningless. We knew that we needed to build a system to support a lot of scale and to support near real time distributed tracking. So we would push stuff to a deeper analytic store, Red Shift in this case, for data science and so on to run regression and to really deep analysis. But for realtime tracking we wanted some sort of dashboard where a product could go and build a funnel and see okay, here is this event and then I can see the users who saw this version or that version and let's see how they perform further down the funnel.

Remember this is like late 2012, early 2013. It's a year and a half before LaunchDarkly was founded. We looked a little bit, there weren't really any big software as a service companies doing this back then. Many engineers who are a lot smarter than me have done before, and I'm sure many engineers who are a lot smarter than me will do again, we thought you know how complicated can this be? Let's just build it ourselves. So we sat in a room a bunch of engineers and we lined out, what are we going to need here.

The first thing we needed was this kind of the secret sauce, the algorithm. It all came down to this idea of deterministically hashing a user into a certain variation. You can imagine, you take a user ID, you take an experiment name, add in a salt there so that there's some extra little bit of text, so if you want to restart the experiment it's really easy. Take those and turn it into some sort of deterministically hashed output. Something like MD5, you can really easily then turn into a hexdigest. So you end up with this hexadecimal string that represents for this user, for this experiment, for this salt, and no matter which machine calculates it, no matter if we wait six months to recalculate it, you will always get the same value, which is important.

You could take this value, you can imagine in the hexdigest space, the largest number there is FFF 24 times. Take the value for the user, divide it by this maximum number and you end up with this user has a number from zero to one. Then that maps really well to this idea of well, 25% of user base is going to see this red sign button and 75% will see the normal control flow with the blue button. And we could still grow these partitions, and say well, now the red button is actually going to have 50%, so any user whose value hashes into this space divided by this maximum hash value, that's point 75 to one point O, okay, they see the red button.

We figured this out. It was pretty simple little trick to actually build a service around it where we were anticipating really, really high through put, we anticipated all of our backend services, all of our UI layers, all of our mobile apps. A lot of asynchronous, so email tracking, stuff like this. All of this would be needing to make some calls to the service, maybe look up in a cache or whatever but then figure out, hey, I've got a user. I've got a little bit of context and a name of an experiment tell me which version they should see.

Also GoLang seem like a good fit here for the high throughput side of things but also there would be a lot of asynchronous stuff that would happen after we've already responded to the client. Imagine say on mobile app, it's going to hit this experiment service, say user A experiment button color, what variation should they get. The service needs to respond back in milliseconds. Very, very quickly. But then in the background it can go and fire off some events and say okay, now let's store this in Redshift, let's push it to our event tracking, maybe there's some triggers that we want to do if this is the first time this user has seen this variation, we're going to go bust the cache or something like that.

GoLang has concurrency primitives built into it that seemed like they would work well here. That we could just fire these all off as go routines. Also as of 2016 GoLang was really sexy and it was cool to say like, "Oh, I have a reason to use this and convince my boss that we should." We built out the base service in GoLang for the backend, so that changed at the time we were already using Casandra. It seemed like a really good fit for this time series that so we're going to have a massive volume of users and experiments and treatment events. Like this user saw this experiment in this variation at this time and maybe a little contextual blob.

Casandra made a lot of sense there. For the actually experiment catalog, so the names of experiments, the variations, who updated it when, is it active, all that sort of stuff. I mean really they're just using a relational database that makes more sense but we didn't want to have to connect two different drivers in this go service. So it was already going to connect to Casandra. We were able to kind of add that all in there as well. And then finally fluentd served as the event buss, sort of backend of all of this. If people aren't familiar with fluentd, it's similar to kafka in the ruby world. Much more in sort of the web app than the big data space.

But it's basically an event bus and we could say oh, this go service is going to fire us some go routines, it's going to send messages to fluentd, and then fluentd knows how to aggregate them, buffer them and then do stuff with them. So maybe sends data off to Redshift, Amplitude is a third party event tracking service that we use to build out some of the front end here. And we could support some simple triggers around say the first time that a user is seen for a particular experiment we go and we send an email off or we bust the cash somewhere. Something like that.

All right, so maybe three months into 2013 we had done it. We had built this service. There was also a small express front end on top of the API so that product could go and manage experiment names and see what's going on. And our product, our PMs and engineers and QA and everybody in the organization was able to track and manage these multi vary experiments with not much engineering support. And on the upside, as we started to experiment with, like use multi-arm bandits and move to where okay, over time we're going to just push more and more traffic to the winning variation rather than have a human being sit there and make all these judgment calls, the system really supported it well.

The biggest thing to me was the paradigm shift in development, where if you don't have to really worry about, "Oh, maybe this is a bad idea." You can get things all the way out to production and have kill switches. You can have ways to track and monitor and automatic alerting. So if something goes horribly wrong, let's flip back to the control behavior, let's figure out what's going wrong in staging and then push a new deploy the next day. But that paradigm shift of being able to put much smaller, much less tested ideas all the way out in front of users and get information on it. It really kind of changes the way that I built web apps.

But then the downside here, I mean this was so much stuff to build and maintain. There was this GoLang service, you had to build it, monitor it, scale it. There's Cassandra keyspaces. If anyone has Cassandra, it's a full-time job. Fluentd provided it's own host of problems. And we still, even though we built all this stuff in house, there were still a whole bunch of third party integrations. We were paying lots of money for Redshift, paying lots of money for Amplitude. We didn't want to build analytics dashboards on top of this, so we have to pay for some SaaS product that does so.

And so the end result, I mean it was very cool. We did a lot with it. This system largely in these same forms are still being used by change dot org today. But building out this whole thing, it was so far removed from our core competency. It had literally nothing to do with online activism and campaigning. And we were using a lot of very good engineering talent to build out and support and maintain and scale and grow this experimentation system rather than building features in our platform and things like that. It's fun but I'll never do it again, right.

In 2017, I left change dot org and I moved to Demandbase and maybe near the end of 2017, so in our developer and managers meeting, so we have been talking about the need to be able to do canary deploys, the need to be able to roll things out selectively to production. Roll things out just the internal demand base stuff, so they can test things more. And so after some of these conversations around the Christmas holiday, I went home and over the course of the weekend, maybe it was like two, I wrote a play API to kind of support the basic thing that we needed.

From our point of view as a B to B company, well we don't just care about the user. We care about the user plus the customer account that they're currently managing. For most of our users, that's one. So I'm an Adobe employee, I log into Demanbase's platform, I just see Adobe's data and information and audiences and all that. We do have internal staff, who you know you can see a bunch of different customer accounts in Switch. We do have agency staff where maybe an agency is managing the account base marketing strategy for three or four different clients. But so everywhere, where we're talking about something's on or off. Or an experiment is running, we always have to first consider, well within this customer account.

I built this out, kind of simple API that would support this given a user, given a customer, given the name of the feature, tell us if it's on or off and provide some APIs so that we could turn things on and off and we could flip back to control and whatever else. So right around the years then, I presented to the Dev managers team, okay here's this cool API. Look at all these wonderful tests. Oh, I decided to use DynamoDB because why the hell not?

And we realized there's so much work here.

Thankfully I remembered my time in change and I remembered like is it really worth taking multiple UI resources, taking a designer's time, taking a back-end engineer's time, taking a DevOps engineer's time. All of these people to get the system out when it has nothing to do with account based marketing. It has nothing to do with what demand base does. It would just be made to enable us to then build out our product.

Mark Cooper is another one of Dev managers and in this meeting, he brought up LaunchDarkly and he said that he has a friend that was working there and they have this cool product and it's feature flagging as a service, multivariate testing as a service, so he went and signed us up for a demo. Gave myself, him and another engineer who's doing Salesforce integration stuff, gave us access. We started kind of fooling around with it and see well, this is going to meet our needs.

Let's talk a bit about what are these needs. So in the B to B world, versus the B to C world, there's this segmentation by customer accounts. The worst thing that we can have is a user at Adobe, say the marketing guy logs in and he's going to go look at how my advertising campaign is doing and let me start a new campaign. Meanwhile, his co-worker also at Adobe logs in and he's going to go start a new campaign and they're going to work together in this. Well if we're running some sort of experiment or feature flag where the one guy is seeing a certain UI and the other person is seeing a different UI, they get really confused. People get really upset. We basically just can't do it within a customer account, we can not provide different experiences. We can provide like gradually improving experiences, so based on your permissions, maybe you have access to some early adopter features. But we can't really do the sort of AB testing that you do in the B to B world of, I'm going to show 50% of people this blue button, 50% of people this red button and let's see which one does better.

So yeah, it's very difficult to do any sort of AB testing across a single customer account. We have to sort of recognize the customer account boundary as within here we need a consistent experience. Across customer accounts we have a lot of room to go, we can do experimentation. We can do early adopter programs. But within a single customer account, it's not so feasible.

Also, Demandbase first and foremost is a B to B company and so we're always our first customer. Anything that we build around marketing, around sales, around advertising, around analytics, internal demand base marketing sales and BI staff use before any other customer use it. So we definitely wanted a tool where we could say let's release some of these features. It's going to be hidden for everybody except maybe QA and product. They're going to do some smoke tests on production. Maybe a little bit of load testing. Then let's turn it on. Internal Demandbase staff who are technically savvy, we'll have them use it next, get some feedback. See what they think. Maybe push out a few changes.

And then roll out to inter relate off the program. So these are really key to our product life cycle and particular in this B to B space where you don't have hundreds of millions of users logging into your platform every month, working directly with customers from the get go. Understanding what are their needs. What are we building. Where are their pain points? And then as we start to get things out into an actual product, giving them access to it.

It's always been a big part of our development lifecycle. Most of our data science products start with a data scientist and a PM sitting with a spreadsheet and a Python notebook and then once there's something there, then oh, let's get some actual customers in here and lets kind of draw on the whiteboard, hey here's what this might look like in a product but what do you think of these results?

Any sort of feature flagging and so we wanted to go with, which LaunchDarkly supported, we wanted to be able to support this idea where we could roll out to different segments over time. This would be my only code snippet for today.I promise, and I'm not a front end engineer at all. This works but I might make some things up as I explain what it does.

When we were first rolling out LaunchDarkly, I think probably like most people in this room, most companies, we were thinking about the user interface later. Like the client side. For the most part we were already using React, using Redux and there's a nice LD Redux library that makes it very easy to just kind of pin this in to your system. The only thing that we really have to do is some basic user construction. So we always have some additional context. What customer account is this user currently managing and then that goes in as like a customer attribute, parametering into LaunchDarkly, and then some default, what are the current feature flags we have so that a local Dev maybe even found an internet connection, could still run the UI apps and do development on it without needing to actually go all the way into LaunchDarkly to set things up.

Once we have all this, well then we can roll something out to production. We can turn on, maybe there's an export to CSV link and we're not quite sure about the backend there and we're not quite sure about the usability and so we turn it on just for product QA. We do a bit of testing and we turn it on for internal customers. They're really happy with it. Product starts building out this early adopter program and we can roll things to without having to change anything except going to LaunchDarkly and say okay now this customer account also has access.

Really importantly if there's an issue, you don't have to roll back. You don't have to deploy. You can go in LaunchDarkly and just go and turn something off. Product and sales and customer success and all of those sorts of people really, really loved this idea that well, we can build out these early adopter programs and actually have them in production in the product. We don't have to have somebody come sit in an office, in a conference room in our office in San Francisco and look at things in staging and we certainly don't want to give customers VPN access so they can get into our staging environment. But this kind of opened up well, we can continue these programs while we have the product and while we're directly building on it.

All right, so the whole title of my talk is that it's not just about buttons, right. It's not just about do I show this export to CSV link or not. So before we even get into the machine learning and the Spark side of thing, just managing an early adopter program like this, especially one where we can start scaling it up and we don't have to go and redeploy. We don't have to rebuild back ends. Features then gets integrated through more than just the UI layer. It's going to get integrated into kind of the whole life cycle of your product development.

For me, and for us here at Demandbase it's meant a couple of things. One, I mean pretty much anybody how started rolling out features like in the UI, I think then the next step you go to is, well hey, my backend. You know my controllers and my APIs. If I'm in this experimental group, if I'm seeing this new feature, maybe I have to hit some different services. So maybe this is we have a service around architecture. Some reals API in the backend is going to go and hit something else. Maybe if you're in the early adopter program, instead of hitting this old service, we're going to go and hit some new service.

Or maybe instead of reading from this table in BigQuery, we're going to go read from a different table in BigQuery. So pretty quickly not just at the UI, like at the react, you know the buttons layer. But at your API, at your backend layer you have to start integrating this idea of well, your feature flags are also going to make changes there.

So taking a step further, almost all of our products are pretty heavily data driven. So we have a lot of backend pipelines in airflow and in Spark and in a variety of other systems, they're all building this data, doing different e-tails, transformations, modeling, pushing data somewhere, where then these real time web app type systems are showing it to our customers.

Very quickly as we start thinking about well, let's use feature flags to roll out these EAPs, our data pipelines themselves might be conditioned on EAP membership. So as an example here, we are rolling out a new pipeline to say, well let's calculate. So for any customers that are in this we're going to calculate based on traffic on their website. What companies were on their website in the last 30 days. And then we'll build this dynamic audience. So daily we can go and update it and say oh here's the companies are on your website the last 30 days, here's the ones whose engagement is now trending. Here's the ones who are particularly interested in the things that you're selling, and things like that.

We had a few things here. Well, we weren't quite sure yet about the architecture. We weren't quite sure yet about the business value and so we wanted to build out this airflow pipeline that's going to run lightly. It's going to generate all these data sets, do a bit of munging, and some queries and such and then call some APIs to build the audiences. And we didn't want to just turn this on full scale for our entire customer base in one fell swoop because we weren't sure, is it valuable? Is it going to work? Is it going to scale? Is it going to break things?

Our airflow pipeline actually goes and calls off to the LaunchDarkly, in this case the HDDP API and then we'll get back, oh, tell me for this particular feature flag, give me this configuration about what customer accounts do you have it turned on for. And then it's going to use that to generate the next steps of the DAG, so then for each customer account it's turned on for, go and run this whole series of steps to build all this data.

The upside there is huge. We could roll this out for one customer, us, Demandbase. And then we could start turning off for more and more and see oh, is there something going on with this query that we need change. Is there some way that we could pre aggregate some of this data and then change the way the DAG is structured.

So yeah, actually having feature flags not just on the UI, not just on your backend but in all the data pipelines that are building the data that your backend is then going to serve, this is something that we found we needed. So finally like workflow orchestration changes, and in particular this has a lot to do with avoiding tech Dev for me. As an example, I ran into, maybe six months ago. We had a series of Akka actors and there's a controller, right. A web request comes in and says oh, go and process this audience and that's going to go kick off a bunch of different actors.

So one's going to check to see do I have the audience cached? If not, let me go spin up cluster in EMR. Let me run a Spark job. Let me wait for it to finish. Okay, once it's finished, I'll get the data. Let me process it. Let me then make a bunch of different other calls to get other data. Do some aggregation. Make sure it's all good and then save it onto Postgres and Redis and whatever other real time data stores that our web apps can read from.

We wanted to roll out a new change in this whole pipeline and say instead of aggregating some of this data ourselves in Akka, we're going to change some of these data pipelines. We'll write to a different table up in BigQuery and the data will be pre aggregated. As we're thinking, well what does this look like? What do we need to change on this backend API on the Akka side of things, we realize pretty quickly, well we don't want to map this onto this existing workflow. We don't want to just add a bunch of if else conditionals and still keep sort of the building blocks of this whole thing.

This is something completely new and more importantly we realize once this is successful, we're never going to flip back to version one. This is not the case of like oh, here's some behavior that some people will see and some people might not. This is not the case of we want to use a feature flag to allow us to turn something off. So when we're doing some sort of database maintenance we turn off the ability to create new objects or whatever.

This is a case that we're going to migrate from as V1 flow to V2 flow. My experience here and what worked out really well was as early as possible figure out is this customer, is this user in the V1 or the V2 flow and then have a whole new series of Akka actors for the V2 flow. This did a couple of things. One, when we're testing out how does this perform under load, we don't have to worry at all about the old code is somehow influencing the new code or even vice versa, that our experimental code is somehow influencing these old code paths.

And then two, once we eventually rolled everybody out to it, so we slowly turned out more and more traffic to the new, this new series of Akka actors, once we're on 100% there, well clean up was so easy and so fulfilling, you know. There was just this whole series of six or seven Akka actors that were just completely deleted from GET. We no longer needed them. Everything was going to this V2 flow and we didn't have to go back in there and figure out, oh where's this conditional and let me look to see which do I actually want? Which is the new behavior? Which is the old?

So yeah, my sort of note on workflow orchestration changes is understanding, are you moving to a new behavior in which case it might make sense to kind of keep things as separate as possible or are you augmenting existing behavior, and you're always going to want that switch where you could flip back if need be.

All right, so I know people are thinking what the hell man? This talk, said machine learning, that's basically artificial intelligence, where is the AI? It is coming right now. First a little bit about our machine learning systems at the main base. The usual customer, the end user of our product is non technical. We do work with data science teams and BI teams and so on at Google and at Salesforce, and other companies that have lots of people in those roles. But most of our customers and most of the users of our platform, our marketing folks, our sales folks, they're non technical.

This means two things. One, me showing them F1-scores and precision recall means absolutely nothing. I'm not going to convince a marketing person that hey, you should trust this because look at this F1-score, it's so good. The second thing though is a lot of these people have been burned by kind of over promises in the machine learning and AI space. And so the idea of black boxes really, really, rightfully so I think, scare a lot of especially non technical people.

So everything that we built, we wanted to make sure that customers could really understand what are my inputs to this system? What the they doing? What else is the system doing? And the they can kind of see, you know I can go and change my inputs. Rerun the models and go and look at well what did the do? What did that change and then the system will give information back to them about what was important here.

Say you have this 30 day audience thing. We're saying oh, hey Adobe, people from GitHub were on your website recently and why is that particularly important? What about GitHub makes that they're a good candidate for you to do marketing to or whatever it might be. Most of our machine learning systems in production are a Spark. A lot of our R&D takes place in Python and Python notebooks and in memory and so on. But when we hit sort of the production level scale, mostly Spark, we have a mix of scheduled batch jobs, so things that run nightly or every hour or every week. We have a lot of API triggered jobs. So the thing I'm going to talk about next, there's literally a button in a UI that a customer hits and that will trigger some API calls that will go spin up a 128 Spark note cluster in EMR. Run a job for 20 minutes. Spin down the cluster, and then they get the results right there in the UI.

And then we do have a few Spark streaming jobs that I will mention. And so then finally, we'll, when we're testing things with these machine learning systems, like how do we know what we're doing? How do we know what's good, what's bad, what's meaningful, what's not? And so for the most part when we're talking about the R&D stage, where oh, there's some new data set and we want to maybe build some features out of it to add in to this existing model. Then it's mostly just R&D and we're looking at notebooks. We're doing Spark notebooks, doing a lot of research. We're not yet necessarily looking at a lot of real customer data. In part because well, we don't have real customer data down in dev and stage environments.

But so a lot of anonymized data, a lot of the main basis on internal data. Once we get into, you know we productionalize something a bit. We've got this system in Dev and in staging, well there our testing is more about does it do what it's supposed to do? Does the job run? What happens if you lose half of the nodes? What happens if this downstream service is down? What happens if 20 jobs try to run at the same time? Does Amazon rate limit you? Things like that.

So it's more computational performance and what are the memory characteristics. What are CPU characteristics. And less what are the quality of the actual recommendations that this thing is making. Of course, we still are looking at statistical measures at this point in Dev and staging but it's really when we get to production that then we can start talking about quality.

It's then that we can start looking at well, what do domain experts think? What do our internal marketing staff think about this version of ranking versus this version of ranking. And we can give them some explanation, oh here's how your inputs have changed it. But at the end of the day, getting things in production, being able to run multiple jobs, say with the same sort of inputs but changing maybe the features that are going into the model, it's huge. And it's the best way that we have to test for a lot of the stuff.

All right so we're going to go all in with feature flagging actually in machine learning, in Spark jobs. So what do we need to build? The fist, the easiest thing, same sort of thing that you build in your backend. Same sort of thing you build in your UI. Just something like pluming changes, you know you can think of these as conditionals. Like if I'm in this experimental state where I'm going to use a new version of this feature, instead of reading from this one place on S3, go read from this other place. So your sources might change.

Same thing like your sync's might change. Instead of writing this data frame out here, well, I've got a slightly different schema because I'm in this experimental group and I'm going to write it out to this other place. It's pretty straight forward. It's the same as in a controller that you might say oh, if you're in this experiment, go hit this service, otherwise hit that one.

Just in general feature flag awareness. We had to come with how are our Spark jobs going to know what feature flags are? What they mean? Where to get them from? How to know what the user and account and all that sort of stuff are? So a bit of architecture on that. And then the most interesting and the most frighting part of all this was well, let's actually use feature flags to do things like change the way we're transforming features. To change the algorithms that we're using in production and that are feeding into our models in real time.

Some more on that one in a second.

So first you saw the feature flag awareness part, our initial pass of this here, we were thinking, so for anyone who doesn't know how Spark runs, it's a series of nodes and in the case we're running on EMR, so you're running a yarn cluster. You can think of a bunch of different containers and then there's a driver that's going to kick off, oh here's a stage that breaks into a bunch of tasks and you have a bunch of workers, executors that can go and run those little bits of tasks.

The Spark execute at the end of the day is Scala running on JVM. It can make an HTTP call. So first we thought well let's just take this client that we already have in these Scala APIs and let's just kind of shove it into our Spark code and let the executors handle when it need some feature flag state, it will go call on Sharkey and get it.

It didn't work at all. At first we horribly spammed LaunchDarkly when we didn't realize oh, wait every executor is going to make this call, so if you imagine multiple clusters with hundreds of nodes running, and thousands of tasks and every single task is trying to do an HTTP call on LaunchDarkly called some problems. We fixed that, we said oh, let's move it up the chain, so like the beginning of the Spark job we can get that state.

It still didn't work. It turned out when you're running Spark and Hadoop and so on in production, hopefully you're running it in a very, very locked down networked environment that can not talk to anything it shouldn't be able to, nothing can talk to it. It turns out our ops team is better than we anticipated and they had kind of proactively blocked these Spark jobs from calling out to anything except things that were white listed.

Even once we fixed this condition of oh, now let's not call on Sharkey tens of thousands of times, we still couldn't even get out to launch Sharkey. So we had a little conversation with the ops team of well, should be open this up or should we keep it closed? And we decided well, these jobs in particular, they're already being kicked off by NAPI. That API already has all of these LaunchDarkly client code that goes and calls on Sharkey, builds up this map of string to bullying of what's the feature flag state. So let's just catch it in the API and then when we go to kick off Spark, so we do the Spark submit call, we can pass them in as arguments.

A huge bonus here was this enabled the data scientists on my team, they could test things without having to actually go in the product and get the right sort of context going on. So if you imagine we're doing some experiment for future transformation and only people under the demand base account and maybe Adobe account should be able to see this new feature.

If the Spark job was the one fetching the feature flag state, then my data scientist has to actually go get the right state going on from the point of view of the application actually in the platform, go kick off the job, which will make the API call, which will spin on the cluster, et cetera.

If we're going to pass then the feature flags to Spark submit directly, well then my data scientist can just stop it. He doesn't have to worry about going into LaunchDarkly or going into our platform, he can just say okay, I'm going to run this job, I want to change the value of this one particular feature. And the upside there is you're no longer building multiple jars. You no longer have to rebuild your job and say okay, I'm going to test this one and then to test this one. And then you're always not sure well did I actually kick it off with the right one the second time or not? My results don't quite make sense.

In this case, you build one jar, we have one Spark job and... so in this case you build one jar, we have one spark job and then data science can just go kick it off through whatever tool they use to manually do this, multiple times and set these feature flag values. So say, let me do a run where this feature transformation is the new version, let me do a run where the feature transformation is the old version, and then we can go and compare the results. So even pre-production it's super valuable there. So, a couple of things that we learn from this, simply enough, and I'm going to talk a bit about this in the end, so at first we are taking all of our feature flags and sending them to every spark job. Well, we've dozens maybe close to 50 feature flags active at any time, maybe three of them matter for this particular work flow. The rest of them we don't care about and they really pollutes your logs, if you've got these big JSON blobs with this list of arguments going in.

So, we did a bit of clean up and a bit of standardization, where you'd say, "Oh, I have prefixes in the feature flag names." And then when the API's going to kick off the spark job, it has configured, "Oh, I only actually care about feature flags that have this prefix, so let me filter and then I'll just pass the ones I care about." On that note as well, initially we were sending this as a big JSON#, you can imagine it's essentially a map of string to boolean, feature flagging is on and off. Makes sense, you can just make it into a # and JSON. But it was messier than we liked, and in particular for the manual data science experimentation thing, where my data scientist wants to go and change something, it's a lot easier to mess up, adding in a giant JSON blob, especially one line giant JSON blob, than it is to have just normal arguments.

So, I can show you an example here. This is from our dev environment, an EMR job that got kicked off from this API, and the thing that I've got highlighted in red there, is where we have this simple argument part in, so the API has done this call, actually I don't know if this was, I think... but the idea was the API and not a manual thing. So, the API calls out LaunchDarkly, "Oh, let me get all the feature flags that start with sales account selection, let me build up these arguments, send them off to spark submit." And then the rest of the spark job then knows, that. "Oh, this particular, no JSON save is true, and no rolls feature is false." And whatever else we've got there.

All right, so we've got the plumbing hooked up, we've got the spark jobs feature flag aware, let's actually do something cool with this, right? In the past year or so, maybe nine months, we've been dealing with a couple of things, so larger and larger customers, that want to have larger and larger profiles. So, one of the things that a customer will fill out is, for this business unit or for this protocol vertical, here's job titles of people that would use it, here are keywords they should be looking up on the internet, and then here are positive examples of existing customers we have for this particular product line. So when we're a much smaller company, maybe 50 customers, 100 customers, something like that.

Well, we started doing these partnerships with google and with Salesforce and with larger companies, and now they want to say, "Oh, I want to run this model but I want to give you 2,000 or 5,000 customers." So orders of magnitude larger than what the system previously had. At the same time, on our back end data set, we have a platform data team, they were working on building out this corpus of all companies in the world, and their subsidiary relationships and all that stuff. It's the same thing, we were facing an order magnitude jump from a few million companies to 30, 40, 50 million companies, that we want to be able to rank and score in the same time constraints, with the same basic architecture going on.

So, there were some transformations that were going on here, that we had to test. Like, how can we make this work with this new much larger data sets, we need to do some sampling, we need to do different sort of distributed algorithms, we can't broad cast certain large sets out, because it will overflow memory. We have to be careful about the way that we're doing joins in sparks, because with these much larger data sets, you can get skew and it causes problems. So we had this problem where we were doing a cosign distance calculation. You can imagine a bunch of numbers and a bunch of other numbers, and you want to see how similar are these two vectors in multi-dimensional space.

And this in particular, fell on its knees when we turned up the total data volume ten fold, and when we turned up the positive examples volume, ten fold. So, data scientists on my team and myself, we built out, okay here's a new version of this feature transformation, it should be functionally the same, the unit tests look good, but it's really scary. We're going to not just change conditional if, else logic, like go call this end point or call that end point, we're going to actually say, based on this feature flag, run a completely different algorithm to build this feature, that's then going to feed into this pipe line, that's going to feed into this random forest model.

And we also need to be able to support, like really easily, be able to switch back and forth, be able to get the data from both, and then we can compare and see how it works out. So another thing that we're doing, in production and we're doing today, is adding and removing features. So, we're never quite sure, is there some value in this other data that we're not including or is this data redundant and really we don't need to include it at all. So, these feature flags here and we can do a lot of analysis of feature importance in production, to be able to turn things on and off and see what's the impact. I lied actually, there is one more piece of code, so this is from a spark pipe line and you can see at the beginning here, we're calculating some features, these cosigned similarity features, and we're using feature flags to say, "Hey, should we add this one or not?"

And so this line, there used to actually be four in there, and we ran this experiment for a while and we realized, "Oh, we can turn off this other one, it's pretty much duplicated by this bios feature." And so, we're able to do this still now in production, I can go turn off one of these feature flags, and run this model. But say, "Let me not include these particular features into the pipe line, into the model." And do analysis like that, live in production.

All right, so then finally, that's changing a lot of things on the spark side actually, in feature transformation, which features do you include, how do you run your ML pipelines? On the infrastructure side of things, LaunchDarkly and future flagging is huge for us. So, as an example, previously we were running all these workflows through Qubole, so they were managed, spark, Hadoop hive provider, like hive is a service, whatever, and over time we realize, well these workflows don't make sense for Qubole, it's very set in stone what we're doing. We don't need to run notebooks on top of it, we don't need hive, it will be a lot cheaper if we just run it on EMR, use spot instants, and get spot instances to run the whole thing.

So, we built out this code, and that was part of that V1 to V2 migration that I was talking about, and we got to a state where, okay, we're in production, we've got our test job, so QA and Engineering are running jobs in EMR instead of in Qubole, everything's looking good. We start turning on a few early adopter customers, I think we got to about 10% of our customers base. We're now running the same job in elastic map reduce instead of through Qubole. And then November something, we had this catastrophic issue, partially our fault, partially Qubole's fault, maybe partially EC2's fault, but I don't think that we can claim that.

We were having capacity issues, jobs were failing in production, Qubole couldn't figure out why their spot instance bidding was not giving us the instances we want. We couldn't change things fast enough, and so my boss, our CTO came over to my team's area and was really upset. What are we going to do about this, things have been down for three, four hours now. And so yes, what's going on with EMR? I know you guys have been testing it and so I showed him the data, I showed him the Grafana Dashboards and here's the amount of traffic we've been rolling over and we made the decision then and there.

Okay, let's flip this over, 100%. We'll go to EMR, this was early December, end of November this past year, 2018. We did it because there was an emergency going on, and we have never flipped it back. We have stayed in EMR 100% since then. And what I attribute that to, is not that I'm an amazing manager and that my team are amazing engineers, what I attribute it to was, well we had a good two, three month period where we were slowly rolling this workflow out into production. Starting to put load on it, starting to understand, is the machine learning working in the same way, how is the EMR different than Qubole. Where are the error conditions, where do we deal with resiliency?

And because we were slowly rolling out to production, when we hit that point, let's turn it on to 100%, there was no new code going out, it was code that had been running, and it's huge. All right, so, that's about it from me, some final thoughts, some little things, right? Staying organized in the way that you roll out feature flagging, especially if you're going to move beyond just the UI layer into your data pipe lines, into your machine learning workflows, into your back ends. So things like naming conventions, things like having a standard library. So every scholar app you have, has a library it can import that's going to build up this custom user context that you're going to send off to LaunchDarkly. Every job descript app has a standard library that it's going to use to build up your custom user contacts, do any logins, stuff like that.

You want it to be dead simple, for any dev anywhere, to be able to use this, but also you want it to be simple for them to understand, when they're looking at some code, and they see a feature flag, well just based on the name, what team owns this, what's the scope here, is this touching multiple layers of the stack? So, things like naming conventions, organization, all that, really helpful there. Related to this, learning to clean up after yourself, and a lot to do with understanding. Am I using a feature flag to bring out some new functionality and at the end of the day, assuming everything's good, I'm going to move everybody to this new functionality. Or, am I using a feature flag to roll out something that I may want to be able to turn off and on, depending on different conditions.

In the case of, this is new functionality, trying to structure things so as early on as possible, you get that feature flag stay, and you fork the behavior, that makes it a lot easier to clean up after yourself, at the end of the day it's work, its tech debt. We have found having regular monthly product meetings, where you go look in the LaunchDarkly UI, you see is there anything that has the same variation for everybody, and if so, who's the dev manager and in the next month, either you're going to remove this or you're going to give us a good reason why you're not going to remove it. That's been helpful.

And then just finally, just this idea that feature flagging goes beyond buttons in the UI so much, my key take away has been we've use this to drive early after programs, we're in production we're doing machine learning experimentation, we're trying out new systems at scale, and we're doing it in a way that's really safe, our customers don't even notice. But then when we hit these emergencies, Qubole goes down and we can't run jobs in production, what we've already been running at 10% and we can flip over and everything works magically. So the paradigm shift that I had seven years back at change.org about, well let's use experimentation and feature flagging to change what the user sees. Well that can boolean through your entire architecture, and it's really cool when it does.

So that's all I got. Thanks to Demand Base for funding most of this work, and change.org, I reached out to them to make sure it was okay that I talked about old things there and they were happy to support it. Both companies are hiring if you're interested in data science, heavy Beta B space, come talk to me, change.org go look on their website, and that is all I got.

Thank you all.