How Progressive Delivery Helps You Learn from Failures
We’ve all experienced failures, and that’s a good thing. If you’re not failing, you’re not learning. Negative experiment results teach us as much as positive results. I’ve previously written about turning failures into success and what is needed organizationally for this to happen.
To recap, failures are unavoidable in production environments. Services behave in unexpected ways. Users surprise us with their actions. No matter how many tests you run before releasing code, it is impossible to eliminate all risks.
Instead of trying to avoid risk then, you need to prepare for failure. You need safety mechanisms to reduce the impact and enable you to quickly recover when everything goes pear-shaped.
One way you can safely fail and quickly recover when deploying software is with Progressive Delivery.
What is Progressive Delivery?
Progressive Delivery is the process of delivering changes first to small, low-risk audiences and slowly expanding that to larger and riskier audiences. Progressing slowly, you can validate the results as you go. If something isn’t going well, you pause and correct. By creating more checkpoints during the rollout, you have more opportunities to test, experiment, and gather feedback from users to improve your release quality. Companies like Target and IBM use Progressive Delivery techniques to control how to release features to users.
Consider the case where you are rolling out a new feature that adds social media posts to your site; before you roll it out to all users, you first want to test it with a small group of users to get feedback and see if it meets performance expectations. Progressive Delivery enables you to do this.
Progressive Delivery in itself is a learning exercise. It is about discovering the best path to take with a feature. You can prepare multiple paths to follow, but each path will lead to different discoveries and learnings.
Aspects to Progressive Delivery that you may be using today which foster learning include:
- A/B testing to decide which version of a feature performs better.
- Chaos Engineering to identify how systems behave in unexpected situations.
- Canary testing to verify a feature works as expected with a small group of users.
Progressive Delivery is composed of two parts—release progression and delegation.
Slowly release features to users
Release progression is the process of adjusting the number and types of users who see a new feature. By limiting the users who see a feature, you limit the impact if something goes wrong.
A pivotal piece to Progressive Delivery is choosing the right audience. This is a deliberate process. You don’t want to roll out to a random percentage of users, otherwise, you may end up targeting higher-risk audiences. You may opt to release first to internal users, then to users who have signed up to participate in a canary or beta program. These users have self-selected and will be more forgiving with failures, errors, and unexpected behavior.
There is no right or wrong answer as to how quickly a release should progress. That is a personal decision. You need to roll out a feature at a pace appropriate to your business. And not just to the company but to an individual feature. The progression for one feature may be days, while another feature could take weeks or months. This is OK. Modifying the release progression isn’t a failure, it helps to avoid failures.
Delegate control of features
The second aspect of Progressive Delivery is delegation. Delegation gives control of turning a feature on or off to the group most closely responsible for the feature’s outcome. In other words, delegation empowers others in the organization to respond quickly.
Imagine you’ve released a new feature as a canary launch. One customer calls into support and is having issues. When you have delegated control, the support engineer has the power to disable the feature for that particular user (or, depending on the potential impact, for everybody). Immediately, the issue is resolved for the customer, and they can continue along their merry way.
Contrast that when there is no delegation of control. Instead of toggling a flag, the support engineer has to escalate the issue. Time is spent trying to reproduce the issue. Once it is reproduced, a fix needs to be deployed. While all this is happening, more customers encounter the same issue leading to more calls into support. The failure is rapidly becoming a larger issue.
Delegating control and empowering others in the organization to react can prevent small failures from becoming more extensive as the number of touch points to disable a feature is reduced.
Recover from failure
Progressive Delivery doesn’t imply there won’t be failures. There will be. Deploys will behave in unexpected and unplanned ways. (And to see how feature flags can help you quickly recover from a failed deploy, check out this video from my colleague, Yoz.)
An implied aspect to both delegation and release progression is having processes in place to manage and control your systems’ behavior when things go wrong. You can only achieve Progressive Delivery with the use of feature flags and observability. These two aspects provide the safety mechanisms to identify and recover from failure.
You need the ability to selectively turn a feature off for a single user experiencing a problem. On the flip side, you may want to leave a feature on for a user experiencing problems and turn it off for everybody else to isolate and troubleshoot.
Recovering from a failure doesn’t have to mean completely disabling a feature. You may want to consider things like load shedding or request throttling. When your observability and monitoring tools indicate a problem, you need processes in place to quickly take action. Even better when the remediation to disable a flag is automatically triggered when a performance threshold is exceeded.
You can’t prevent failures from happening, but you can prepare and put processes and systems in place to minimize the impact when failures happen. With Progressive Delivery, you have the tools and tactics you need to successfully deploy and release software safely.