#ToggleTalk 3: Resiliency
According to Merriam-Webster, resilience is defined as “an ability to recover from or adjust easily to misfortune or change.”
Toggle thought this topic was a good follow-up to last week's conversation about productivity. We are currently having to adjust expectations about what being productive is. We've had to adapt to new remote working situations quickly. Systems are being pushed to the limits as large numbers of people quickly moved to cloud-based solutions for meetings, social gatherings, and educating students. How well you adjust to these changes requires resiliency.
Questions we posed on resiliency:
- How do you define resilience?
- How do you build resiliency in your systems?
- How do you increase your own tolerance for disruption and failure?
- What value can we derive from critical events?
Whole vs. parts
If you are looking for resilience, you have to look at the big picture. From a technology perspective, if you are striving for five-nines availability, you have to look not just at the technology but the people, the processes, and the organization as a whole.
Sociotechnicalmodels do just this and help when it comes to resiliency. Sociotechnical theory looks at the interrelationships between the social and technical aspects. Consider how people will use the software, who will be using it, who will be supporting it. This can help you build resilience as you adapt to the changing social aspects.
When looking at the social aspects, remember that resources are finite. Andpeople are not resources. People do not have an infinite ability to respond to and recover from failures. We use metrics to track the health of individual elements of our systems. We can also use metrics to track our own health and ability to respond to failures.
Humans are resilient, systems are robust
One aspect of resilience is sustained adaptability. This is where humans come in. People make decisions about what to build, how to build it, and how and when to change them. Systems will not adapt without humans. It isn’t possible to separate the human from the tech.
I love the framing of incidents as surprises. It takes away some of the negative stigma of incidents being bad. If we frame incidents as surprise learning opportunities, it helps us figure out what the best response is.
The conversation about resilience and surprises seemed to naturally lead to a discussion of mental models. A mental model is an explanation of someone’s thought process of how something works. Mental models help us understand and interpret the relationships between things. When we encounter an obstacle, we may have to update our mental models. The solution that worked previously may not work the second time around. Our ability to continually update our mental models is part of our resiliency.
During #ToggleTalk, we touched on all four concepts for resilience as outlined by David Woods (see article below):
- Ability to rebound
We need to look at technology from a sociotechnical perspective for true resiliency.
Thanks to everybody that joined in this week’s discussion on resilience. See you next week on #ToggleTalk!
There is an upcoming conference (next week!) if you want to learn more about resilience engineering and the process for building systems that can withstand unexpected failures. You can register here: FailoverConf (free of charge!). We will be there, and so will one of our Developer Advocates, Heidi Waterhouse. Or you can check out these recommended reads and talks: