Relying on End Users to Test Incremental Changes

In May we hosted our Test in Production Meetup at the Algolia offices in Paris, France. Aurélian Didier, Lead Back-end Engineer at Workwell.io, shared how his team gives their customers the tools they need to integrate their own services into the Workwell web and mobile apps, and how they manage testing across these services.

“We want to enable them to have a wider scope of configuration, so basically they can change more things in the application and we want to give them the ability to have something that’s really specific to their company. So once they open the app they feel like they’re using a software that they built themselves.”

Watch Aurélian’s talk to learn how Workwell has automated these processes so they can ship multiple times a day. If you’re interested in joining us at a future Meetup, you can sign up here.

TRANSCRIPT

My name is Aurélien. I work for Workwell. It's also known as Never Eat Alone. We recently, like a year back, we changed the core of the product. So we decided to rebrand it and rebrand the whole company, actually. So the first time we talked about relying on end users to test everything on production with the team. The first image that came to mind was this. So basically you give all the tools to the customer and it destroys everything on your back, and I don't know if you already tried a saw like this but it destroys everything.

So what is Workwell? We are, like I said, a company that was called Never Eat Alone before, and now we tried to ... So it was an app first, focused on getting people to talk to each other and to have lunch and dinner with each other in big companies where they don't have any social interactions, and sometimes they work five meters away from each other and they never get to talk to each other. So this app was made to make people actually meet, and if their background and social interests fit then they can automatically be matched and they can share meet ups, lunches, and stuff like this. But with Workwell, we tend to go even further. So not only the social aspects but everything that's around HR, all the services that are aggregated into a company for the employees, and we want to gather everything into one simple app to make it easier for both the company and the end user.

So like I said, we tried to centralize everything that's a corporate service into one single secured app. So basically we can have HR stuff, food delivery services, document sharing services that are sometimes home-made inside the company. And we want to gather everything into one simple phone app so it's very easy for the company to manage because there's one entry point, and for the user that's really easy to use because everything is in the same place, in the same format, and there's no need to keep in mind that log in and that password for this service or that phone number and everything.

And in the meantime, we want to build the platform for services that we want to integrate into our system. So there are things that we will provide directly. There are things that are provided by the client. So sometimes they build services in-house and we give them the tools to integrate. If they have a web app, we give them the tools to integrate it into a mobile app.

And our goals for that is that for the employees we have one single point. For the authentication method inside the application, one single authentication method. And for customers, the ones that are paying us, we want to give them one single format of service. So it's very centralized point of view. But to make it work very centralized like this, we actually distribute everything. So our back end, first. So we built it with microservices and, I think, right now we have between ... That maybe 16 microservices. They are not that micro, but they're completely separate.

We have the integrated apps, so what I call corporate services before. We have some that we provide. So those are the first-party services. We have some second-party services. So one customer has built something that they use in-house and they want to integrate it inside the app, but they also want to maybe integrate it with other customers. And so we add them, integrate those apps, and propagate to our customers. And then we have some third-party services and they have built already some mobile applications, some web applications sometimes, and we add them to integrated so it is displayed and available inside our application.

We decided to distribute our clients so they are separated into physical instances. Physical, independent instances. We really wanted to focus on encrypting everything so every client has its own database encrypted with a set of keys that are managed completely differently between all the clients. And eventually, we needed to test everything because that makes a long list of things that we will need to do, and we needed to distribute the test as well.

So a few constraints that we have into our architecture. So first, the platform code, as usual ... Not as usual, but most of the time. We have a SaaS platform and this is not really centralized but we tend to test everything directly automatically and deliver as fast as possible. And for the content that we deploy to clients and the integrations that we deploy to our clients, this is a bit more complicated because this is not system-wide. This is sometimes a client will ask for something that would be reserved for him, and not all of our clients. And we also want to integrate external applications and external services, and they don't have the same pace as we do. They don't have the same pace between them, and we need a tool for that because since this is not the same pace, the same space, because we don't test everything in one place, we needed to build a tool for them and enable them to make their own test as well.

But as I said, as we spread clients into a distributed list of clients, the retest set up is quite easy because they are already distributed and we can easily change the traffic between one client to one another. And there's one thing also is that we deliver some client-specific configuration. So basically, the UI can change between two clients. So they all have the same application that comes from either the Play Store or the App Store, but the insides of the application will change regarding the client. And this needs to be tested, because we cannot afford to display something that's not well-tested or that's breaking into one list of users at the clients.

So basically, this is our architecture. Mostly oriented around the back end because I work for the back end. And so everything that's on the top of the page is things that we handle internally. So everything could be the back end, actually, but what we do is that here we focus on everything that we'll handle, requests and integration, with services. We do have tools, some external tools and internal as well, like Amplitude, Chart.io to verify that everything is analyzed, everything, and verified that everything goes well when we are using mobiles. And we do have bots as well that we developed ourself to easily deploy the configuration, easily check on some very specific pieces of configuration, and test images, turn off some features, turn on some features, and we just talk to the bots. Those are implemented with Slack.

So a funny thing is that sometimes we are demoing at a client and we change the configuration for them just by talking to our phone, and it propagates the changes and can turn on some really interesting features.

So what we need to test, actually. So I separated into two columns because these can be automated. So platform code, we tried to automate it as early as possible and make less human interaction possible. We test the data as well, for consistency and everything that can be ... Some missing data, some corrupted data. We make sure that it doesn't break anything into our system. We do test mobile applications, of course, and the features. So when I talk about features, we do the test with several services, and sometimes it also gathers the back end and a mobile application. So we have simulators and stuff like this.

We also have features in that side because that side is what is not completely automated. So we do have a lot of human interaction here, and the UX and the content needs to be ... Especially those two, they need to be verified by a human eye, and the field inside the application, I don't know the quality of the images that are displayed, the media. Everything around the UI needs a strong eye and human interaction.

And also, the first day I joined actually at Workwell, this is what I was required to do. So ship production as fast possible. So the less time we spend deploying production, the better it is. And also, as often as possible. So if you need to make it 12 times a day, we will do 12 times a day.

So speaking about timeframes, this is our current timeframes for all the deliveries that we do. So the platform code, as I said, this is quite often but we can do even better. So right now, several times a week we ship to prog all the code that we built. It happens to be several times a day sometimes when we figure out there's something missing or we find a bug. And our policy is to avoid rollback by deploying even faster after and fixing everything, and avoid data losses in the first place. For mobile application, once a month, approximately, just because you need to ... You are dependent on the stores, and especially App Store, they need to review everything so it's not easy to deploy more than once a month. But we do tests every week, new builds.

And for, like I said, we have some customer-specific media and configuration, and this, we ship it quite often. Like several times a day. As I said, we have internal admin tools and chat bots to deploy things to change the configuration, and we do it quite often. I wrote 10 plus, but sometimes it can be 50 times a day for one specific client, because as I said, the clients are distributed. And it's very short, most of the time, but some contribution are quite difficult to pass right now. So sometimes it could take several minutes and this can be a pain if you are setting up stuff that we need to test after.

And for integrated apps, the ones that supply the integrated apps, they usually ship a few times a week. And it takes some minutes but we really want to add them ship faster. Inside our system, it's propagated quite quickly. It's a matter of seconds. But for them to ship, they need to deploy ... Most of the time they don't deploy just the integrated with Workwell, they deploy other stuff, the apps, so it takes a bit longer.

And this is our goals for the next month. So for the platform code, eventually we will ship several times a day, in a matter of seconds. And for the other things, that's pretty much the same. We really want to focus on helping the integrated apps to ship faster so they can enable the clients to see the changes faster. And for everything that's media and configuration, we still want to keep the same pace but we want to do less iterations. So we want to help the customer deliver the right configuration at the first time or the second time, and not 10 iterations of a failing configuration. And we want to enable them to have a wider scope of configuration, so basically they can change more things in the application and we want to give them the ability to have something that's really specific to their company. So once they open the app they feel like they're using a software that they built themselves.

So for our platform code, what we do for test ... So as I said, we want less human interaction as possible, so we automated everything. The end result is that we have almost zero percent human interaction. Everything is done automatically. So as soon as we push line of codes, everything is test directly. All the PRs are automated and we still review everything, but the shipping part the testing part is done automatically. And it's done across all the services that I tested before, the microservices into the back end, and we have an unusual ratio between the unit tests and functional tests, just because we tend to keep the code simple. But we have a lot of cases to cover because we have distributed clients with specific configuration and we need to be able to cover all those different configurations, and also cover all the cases for the integrated apps.

So for the rest of the tests, what we decided is that we need to rely on the end user and the customer, just because we are a small team and as you can see, the architecture is quite wide and you can multiply it by the number of integrated apps that we have. So about 60 or 70, and they are ... Yeah, 70. They are changing stuff every day, and we have about 30 clients and they do want to change stuff every day. And at some clients, the project management bot, they are more than we are inside Workwell. So we need to enable them to cover all the tests and empower them to publish the configuration to help them scale faster.

So for integrated apps, as I said, all the clients are distributed. So the canary test, if we can call it like this because it will target one specific customer, and it's quite natural and they can target ... The integrated app can target one client. So we give them the ability to be seen inside one customer only and they will be tested there. We do some AB testing as well. So inside the customer we can handpick a list of users that we want to test, and so the first step is we choose those users and the second step is that we select a random list of users to make it a real test. We really want to confront all the code to the reality as fast as possible. And so everything here is done on production, always.

And we do involve a lot of people from the client, the integrated app, which is most of the time an external developer team, and the deployment team at Workwell. So we have a team that will handle all the customer requests, and all the background ship of the integrated apps, and they will make them work together and test their integrated app together. And we do also involve our QA for everything that's related to the code base and the core of the back end, and the mobile apps.

So basically, this is how we do it. As you can see, for Workwell, this is always happening in production and we give the possibility to the service provider to have both staging phases and production phases inside Workwell production. And we give them the ability to have a subset of testers, end users, and a subset of clients as well. But the schema is quite difficult to draw. So only for one client.

For the content, like for integrated apps, the canary test is quite natural here. And we also want maximum human interaction, because as I said, there's a lot of UX and a lot of media like images and colors and fonts and everything that's textual that needs a human eye to be tested. As we enable the customer and the product management team inside the customer, as we enable them to test and deploy everything, as we grow this is really scalable for us because we can have as many clients as we want. We will still enable them to make all the tests that they want to do.

And our policy here is that we give them the possibility to create, modify, and really test deeply the configuration and everything that they are integrating. And they can get frustrated, just because they are testing, and what we do is that we really want to make them the owner of the feature they are deploying, the configuration they are deploying. So they are less frustrated and for our clients, if they do well on this test coverage, we can offer them all their integrated apps that were not integrated into their system. And for integrated apps, we will propagate them into the other customers. So it's a win-win. So the tests, they make everything at the client confidence, and then if everything is happy, since they work for us, we can work them with other clients or other integrated apps. And the key here is really to empower the client.

So as I mentioned earlier, we have a chat box and admin tools to validate and publish everything faster, and right now the deployment team is using it. But soon the client will be able to talk to the chat box directly and change the configuration right away.

So this is what it looks like right now. We have a bunch of clients and once the client PM here requests a change, it goes through the Workwell deployment team that approve the change or make some minimal, little changes. And then it deploys it, publishes everything through the admin tools that we have or the bots directly. And then it goes back to the client here, the client environment, and the users and the client PM can test, validate, and if they're okay with that they feel like they own the feature, they own the change, and they will never get frustrated. And so what we experience so far is that at the beginning, we didn't have all the tools to publish faster and to approve everything, so it was quite difficult. But right now, we do have those clients talking to each other to say, "Hey, I did that change in the configuration. That's possible to do it like this." And they publish those changes by themselves without going through us. So it goes really fast for them and this is what they like, and they are never getting frustrated because they can change and rollback really quickly.

And soon we'll remove the deployment team here, and as you can see, there's only automated stuff that are left in this schema here are the servers and all the tools and bots. There are no human interaction anymore from Workwell. Everything is done by either the client or the integrated apps. So for us, it's a key for scalability.

In the end, it feels like we are just watching our clients and integrated apps playing with each other. But in the reality, we are still that guy here because we are still testing everything, because when we validate something for one client, before we propagate it to other clients, we still want to make sure everything is compatible with the other ones, and the same thing for integrated apps. Before we propagate them to other clients, other environments, we want to make sure that they also fit our needs, not only the client needs.