#ToggleTalk 3: Resiliency

LaunchDarkly
By Dawn Parzych   •   April 17, 2020
LaunchDarkly

According to Merriam-Webster, resilience is defined as “an ability to recover from or adjust easily to misfortune or change.”  

Toggle thought this topic was a good follow-up to last week's conversation about productivity. We are currently having to adjust expectations about what being productive is. We've had to adapt to new remote working situations quickly. Systems are being pushed to the limits as large numbers of people quickly moved to cloud-based solutions for meetings, social gatherings, and educating students. How well you adjust to these changes requires resiliency

Questions we posed on resiliency: 

  • How do you define resilience?
  • How do you build resiliency in your systems? 
  • How do you increase your own tolerance for disruption and failure?
  • What value can we derive from critical events?

Highlight reel

Whole vs. parts

If you are looking for resilience, you have to look at the big picture. From a technology perspective, if you are striving for five-nines availability, you have to look not just at the technology but the people, the processes, and the organization as a whole.

Sociotechnicalmodels do just this and help when it comes to resiliency. Sociotechnical theory looks at the interrelationships between the social and technical aspects. Consider how people will use the software, who will be using it, who will be supporting it. This can help you build resilience as you adapt to the changing social aspects.

When looking at the social aspects, remember that resources are finite. Andpeople are not resources. People do not have an infinite ability to respond to and recover from failures. We use metrics to track the health of individual elements of our systems. We can also use metrics to track our own health and ability to respond to failures.

Humans are resilient, systems are robust

One aspect of resilience is sustained adaptability. This is where humans come in. People make decisions about what to build, how to build it, and how and when to change them. Systems will not adapt without humans. It isn’t possible to separate the human from the tech.

Surprise!

I love the framing of incidents as surprises. It takes away some of the negative stigma of incidents being bad. If we frame incidents as surprise learning opportunities, it helps us figure out what the best response is. 

Mental models

The conversation about resilience and surprises seemed to naturally lead to a discussion of mental models. A mental model is an explanation of someone’s thought process of how something works. Mental models help us understand and interpret the relationships between things. When we encounter an obstacle, we may have to update our mental models. The solution that worked previously may not work the second time around. Our ability to continually update our mental models is part of our resiliency.

Summary

During #ToggleTalk, we touched on all four concepts for resilience as outlined by David Woods (see article below): 

  • Ability to rebound 
  • Robustness 
  • Extensibility 
  • Adaptability 

We need to look at technology from a sociotechnical perspective for true resiliency.

Thanks to everybody that joined in this week’s discussion on resilience. See you next week on #ToggleTalk!

Want more?

There is an upcoming conference (next week!) if you want to learn more about resilience engineering and the process for building systems that can withstand unexpected failures. You can register here: FailoverConf (free of charge!). We will be there, and so will one of our Developer Advocates, Heidi Waterhouse. Or you can check out these recommended reads and talks: 

Recommended Reads 

Resilience is a Verb 

Four concepts for resilience and the implications for the future of resilience engineering

Report from the SNAFUcatchers Workshop on Coping with Complexity

Above the line, below the line

 

Recommended Talks

OOPS! Learning from Surprise at Netflix

How did things go right? Learning from incidents

A Few Observations on the Marvelous Resilience of Bone & Resilience Engineering

You May Like
  •   BEST PRACTICESTesting in Production to Stay Safe and Sensible
  •   BEST PRACTICESWhat Is Continuous Testing? A Straightforward Introduction
MAY 6, 2021   •   BEST PRACTICESRelease Testing Explained
MAY 4, 2021   •   BEST PRACTICESOrchestration vs. Automation in the Cloud: How to Use Both
APRIL 29, 2021   •   POPULAR1-800-FLOWERS.COM, Inc. Migrates to the Cloud Seamlessly with LaunchDarkly
APRIL 28, 2021   •   BEST PRACTICESWhat Is Container Orchestration, Exactly? Everything to Know