• Overview
  • Transcript

No Fear: Scaling the Xero Developer Platform

Josh Barr Xero

Xero has been on a journey from scary and risky deployments with a home-built feature management tool to achieving high confidence and continuous delivery at scale with a global engineering team. LaunchDarkly has played a crucial role in the company's transformation. This is their story.

Downloads slides

Josh Barr

Josh is the Lead Portfolio Architect for Ecosystem at Xero. A qualified graphic designer by trade, he made the switch nine years ago to pursue a technology career. Prior to joining Xero, Josh was a maintainer of the Wagtail CMS, and Technical Director at Springload (a digital agency based in Wellington, New Zealand). He's worked as an educator, designer, developer, team leader, and architect, with financial institutions, airlines, startups, and civics organizations.

(upbeat music) - Hey, everyone, I'm Josh from Xero. And I'm joining you today from my place here in Wellington, New Zealand. It's great to be with you. What I'd like to do today is take you on a journey. And this is Xero's story of how we've evolved feature management over the years. What I'd like to do is share some things that have worked really well for us, and also share some of the things that we've got a little bit wrong, that you might have an opportunity to get right in your journey with feature management. And I've called this talk, no fear, because it's really the story of how we went from dreading production releases to being much more comfortable at making bold changes to our architecture at scale. So let's dive in. Xero builds beautiful software for small businesses. And we help small businesses and accountants and advisors all over the world to thrive. We've got about 3000 staff worldwide, and 1000 people working in our product team. So you can imagine, on any given day, we're shipping a lot of software. And our story today actually starts way back in 2015. 

Back then, Xero was a software suite that took a long time to release. So maybe four or five hours to release sometimes. And so the way we would manage this process is we had a giant release calendar, something a little bit like this. And a team would come along, and they would decide where in the week they wanted to book their release slot. And when you've got lots of teams working on a big software product, you can imagine this calendar got pretty congested. And if you were halfway through a release and your time window was running out, well, then what you'd have to do is vacate the pipeline so that the next team could come in and have their turn. One of our data science teams was working on a brand new feature. And they were staring at the prospect of having to wait up to two weeks to be able to find a slot in the calendar to be able to deploy their new feature. And they realized this is not good software engineering. And they realized that they needed to do something else entirely. What they decided to do was decouple their deployment events from their release events. And they created Xero's first feature Management API. The way this would work, is the web application would call out across the network to Xero's feature API, and it would ask it,  "Hey, can the user see this feature? " Yes or no. 

And the feature service would respond back to the web application. So that's a nice and simple paradigm. You can kind of understand that working. And it was really good for us at first because that web application took four hours to release and Xero's feature system only took 25 minutes to release. So this gave us a real boost in our productivity. And it got very popular. So all of the applications that Xero looks after, in our stack, well, they all started to use this feature Management API to do feature toggling at scale. So this feature service ended up under really, really high load. And there was two really interesting things that happened as a result of this. Well, number one, the uptime, and the availability of all our applications was really tightly coupled to the uptime of this feature system. If something went wrong, everyone felt that pain. And the second thing that happened was, well, this team now had two full time jobs. They had to do their ordinary data science work, and they also had to manage and maintain this feature management system. And so we ended up with an overworked team, and an overworked system. And you all know how that's gonna end right? Like we ended up with this kind of load and random spikes and traffic happening on our system. So I pulled some of these stats from 2015. And we can see that it wasn't uncommon for the feature system to take over 20 seconds to respond in some cases. And that's gonna be a user that's not seeing the right experience. 

What had started out with a lot of early enthusiasm for us, over time gave way to fear and to frustration. And it was fear, because this feature system had actually blown up a couple of times. We'd had some some quite visible incidents that had kept the whole team up late at night. And it was frustration because well, we had some staffing changes. The people working on this feature system, initially, they moved on to other roles, or they left the company to go traveling overseas. Back when there was open borders, and that was a thing you could do. And so active development on the solution really stagnated. It really slowed down. And this is where my team enters the story. So I work in Xero's ecosystem team. And it's our mission to build this network of innovation on small business with. Xero at the heart of it. We work with hundreds of partners all around the world and 10s of thousands of developers. It's quite an exciting part of the business to work in. And one day, one of the engineers that I work with, came to me with a problem. And he said,  "Look, we've built this new version of our developer documentation, our developer website, and we can deploy this application in about 10 minutes. It's really quick. Now, I don't wanna wait 30 minutes for the Xero feature system to finally release the feature out to customers. That's really slowing me down. It's a real drag. " And he said to me,  "Look, is there any other way we can do this? Can we try something else? We need a better tool. " And, you know, we'd really noticed that our solution hadn't kept pace with the innovation that was happening in the market. If you're building a modern serverless application that you can deploy it in 30 seconds, you don't wanna be waiting 30 minutes for a feature change to roll up the door. So we knew that we needed to find a new way of doing things. And we had some things in mind as we were going on our search for a new feature management solution. 

The first of them was this. The tools at work should be as good as the tools at home. And I first heard this in a talk from the UK Government. Digital Services team, way, way back in 2014. And the premise of this phrase is that tools are getting better and better and better all the time. And if we want people to come and work on our problems, and not our competitors problems or another team's problems or in a different industry vertical, then we really need to make sure that the tools that we offer people are keeping pace with what's best practice out in industry. And if you think of all the great stuff you can use as an individual developer out there on the internet, things like that, netlify or next JS or GitHub actions, people expect a really high bar of developer tooling quality these days. So that's something for us to keep in mind if we're working in a big enterprise environment, where we have a lot more control over our toolset. The bar is pretty high. The second principle is this one. And this comes from Rich. Archibold's fantastic article with the same name, Run Less Software. The premise of run less software is that time is critically short. And that in assess business, you've only really got time to be focused on your unique customer edge. You can't spend all this time building out tools and all the supporting infrastructure that you need. It's much much better to go and rent those things from the market. Because that's gonna let you work at a higher level of abstraction, much higher level of abstraction. You know, we wanna be in the business of building the house rather than having to build the hammer and the nail and the brick as well. 

It's really interesting because down here in New Zealand, we're like a couple of thousand kilometers away from well, anyone and anything. And down here self reliance and doing things yourself is almost seen as a virtue like it's so ingrained in society. And so this idea of renting something that someone else has built, well, that that kind of goes against the way things are done down here. And so you might have to go on a little internal engagement campaign. And this is exactly what I had to do. I wrote this blog post for our internal engineering blog, and I gave it a deliberately clickbait title. 

It was called, 'Your credit card is the best developer tool that you're not using.' And the whole point of this was just to summarize the conversation in run less software, and really start to ask the question, how could we use rental as a software delivery strategy? And I think it was really successful at starting that conversation. The third principle is this one. To make big changes, you've got to start small. Now you're gonna have some senior stakeholders in your organization, and they will have been around a while and you know, they would have seen some things. They will probably have had a technology decision, backfire on them really publicly, you know, and all really expensively, and so they're gonna be naturally really reluctant when you come in and say,  "Hey, we should replace that thing that you've just spent all this time and money building with an off the shelf solution. " And they're gonna have natural resistance to that kind of advice. So you really have to do your homework and find out if a new solution is actually gonna be better for your environment. 

This trainer talks about this interesting approach to minimum viable product. And he uses the example of cake and I mean, who doesn't like cake, right? And he says, if you were gonna get a wedding cake made, what you'd first do is well you'd go to bakery and you'd try their cupcake. And if you liked the cupcake, well, then you might take things a little bit further and invest in that larger cake. And this is exactly the approach we took when we reached out to LaunchDarkly, and started working on a proof of concept in Xero. So the first thing that we put LaunchDarkly to work on in Xero, was powering our single sign on solution. Sign on with Xero. And this was a real headline feature for us last year in 2019. But it was also a brand new stack. So there was no traffic going through this yet and we hadn't exposed it to anyone. So it's very low risk for us to try things out in this way. And we'd also lined up some third party security testers, you know, some professional hackers to come and have a crack at breaking the software really throwing the book at it. So this was a great opportunity for us to validate LaunchDarkly you know, that we were gonna be dealing with the right kind of security concerns and that we were gonna get a clean bill of health from our friendly neighborhood professional hackers. So that went really well, that was really good. We did some load testing, we figured out that this was gonna run at production scale for us. And then we were able to use the tool to onboard one or two test customers to try out this new feature. This was very successful for us. 

The next thing we did, if we think back to that developer experience slide is we were really focused on trying to create a great developer experience for our internal engineers, especially when it came to on call. Now, if you've ever managed an API platform before, you'll know that you really wanna have lots of switches and levers that you can pull to be able to shed load really quickly when things go bad. You know, it only takes one really badly behaved integration to cause real headaches for you. And so we use LauncheDarkly's kill switch capability to give us ways to quickly switch off misbehaving integrations. And we can see in this example here, Brian, who's one of our awesome quality engineers, has been woken up probably by an alarm at 11 o'clock at night. And he's jumped in, and he's been able to identify the misbehaving app, drop their ID into LaunchDarkly and solved that problem really quickly. And because of that, that quick feature turnaround, like under one second, we're able to recover from that potentially very disastrous incident very, very quickly. And that's fantastic. That's a DevOps superpower. The third thing that we did is we really wanted to validate some of the claims LaunchDarkly made about performance. Now, remember, we've been really burned by this in the past with our home built system. And so one of our teams who looks after a pretty important internal API, decided to put all sorts of timers into the application so they could measure how quickly LaunchDarkly was performing for us. And this was great because we learned that the average feature evaluation time was like measured in microseconds. And the slowest ever response time that we saw was still under one second. And that's such a far cry from the world that we did have before with our previous system. So we took all this information, we put it in a big business case, to the stakeholders to really make the case for going further with this, this LaunchDarkly relationship. And we made sure to include a few key takeouts from developers things like, I never wanna use the old feature system again, please let me use a great tool like this. So that really helped us to get this business case across the line. 

I would love to have been able to come here today and say that we've managed to complete migration, we've been able to get rid of our old bad feature system entirely and move to this great new world. But everyone knows in an enterprise environment, you know, that's just not gonna happen. You've got far too many teams and far too many roadmaps that would be disrupted by that kind of migration. So we haven't completely escaped the old world yet. But I think we are starting to reach an escape velocity. We are moving away from it. And so I pulled out some statistics which might be interesting to you. So let's take a look at those. 

So what we can see is since 12 months ago, LaunchdDarkly is now responsible for 85% of all feature management activity within Xero. The other thing that's really interesting about this number to me is that we're doing five times as much feature management stuff. So adding and removing feature flags or adding and removing users from feature flags, than we were with our old system in the previous year. So it's just testament to how having really good tools actually helps you to, you know, to adopt this engineering practice. And so if you're trying to do feature management, having a good feature management solution means you will do more of it. And this is what we've found. And last month, actually, we hit the tipping point with traffic. So we're now serving more feature evaluations through LaunchDarkly than through our legacy system. So we haven't completely escaped it yet, but I think we're well on the way. And what we wanna do is keep harnessing this momentum, to do even more amazing things with LaunchDarkly in the future. And this screenshot here is actually something that I took from a television advert that we ran in the UK last year. And it features a mechanic, changing out the tire on a car as the car rolls down the road at high speed on two wheels. And it's such a great analogy for what we're often doing in software. Right, as Freddie Mercury says,  "The show must go on. " You just can't stop when you're running a global platform. Have to keep the lights on you have to keep things moving. But you still need to be able to change really critical parts of your system. And so what we've put. LaunchDarkly to use on recently, is a refactor of something pretty important to us. And that's our login system like logging the customer in. So let's have a look at that. 

As you saw back at the start. Xero has lots of applications. But let's just say for the sake of this demonstration that we've got three, 'cause that's what I can fit on the slide. And when a user goes and visits one of these applications, what happens is some authentication magic happens. And they get redirected out to our login system. They enter their username and password and go back to the application. I mean, you've all used a process like that before, a single sign on process. Now our login system is managed by an internal platform team. And they've built a new one. It's called identity. And it's a really nice implementation. It actually powers some of that single sign on stuff we were talking about before. 

The challenge we have is how do we get traffic across from this old login system to this new identity system without stopping, without taking an outage? And even if we were to take each individual application and switch them over one at a time, well, we would still be incurring some sort of downtime on that application. So that isn't really a process we can use. What we need to do instead is find a way of gradually shifting traffic from old login to new identity. And so the platform team took a look at this. And they said,  "Hey, what we can do is we can put LaunchDarkly into the little authentication library that each of our applications has. " It's pretty common to most of our applications. And what this gives them the ability to do is then to manage that rollout on behalf of each product team. So we can start shifting traffic from our old system to our new system. And it's really inverted control. Because now this platform team can manage that rollout on behalf of each of the individual feature teams, which is really good. This is an absolute. DevOps superpower for us. And if you think way back to the start where we had that giant release calendar, you know, this is just a whole world away now. Where we're able to make really bold changes to our platform at scale. And we can manage the whole thing through a single LaunchDarkly dashboard, which is really fantastic for us. I think, if I had proposed this use case 18 months ago, before it started this journey, I would have been laughed out of the room, right? It kind of sounds absurd to refactor your login system on the fly. But because we've built up confidence with the solution, and we've really road tested it on a number of use cases for us, now in 2020, we've got no fear with refactoring, something like authentication using launched athlete. Right, I'm pretty much out of time. So these are the key things that I'd love for you to take away from my talk today. 

Number one, invest in your developer experience. You know, tools are getting better and better all the time. So if you want people to work on your problems, make sure you give them good tools. You know, it's also about making engineering safer for your team. And that psychological safety is really important, I think in retaining great talent. Number two run less software. I mean, this is the whole moral of the story for us, right? Like we tried to build something ourselves, and we realized that at scale, this is actually very hard to do. So it's best left to the experts. Number three, to make big changes, you've got to start small. So find a willing and able team and run a pilot program in your organization. That's a great way to start showing the value of a solution like this. And while you're there, do your homework. You know, there's gonna be stakeholders in your organization who have reservations. So find out what those are, and make sure that you measure and get the testimonials and all the things that are gonna really make that an easier sell for them. And finally, number five, build momentum. If like us, you're trying to run away from some horrible old legacy system. Well, I think the key is to run faster. And just to show the organization, all the amazing things that you can do with a new solution that you couldn't do before. Ultimately, I think this is how you're gonna win the hearts and minds. Momentum is really attractive. Well, that's all I've got time for today. Thank you so much for listening and enjoy the rest of Trajectory. (upbeat music)