For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Sign inTry it free
DocsGuidesSDKsIntegrationsAPI docsTutorialsFlagship blog
DocsGuidesSDKsIntegrationsAPI docsTutorialsFlagship blog
  • Flagship blog
    • 52 Blog Posts, Claude, 3 Prompts, Under an Hour
    • Shipping from Oakland: An Observability Hackathon Recap
    • Day 12 | New Year, New Observability
    • Day 11 | What engineering teams really want from Observability
    • Day 10 | Why observability and feature flags go together like milk and cookies
    • Day 9 | The Three Ghosts Haunting Your AI This Holiday Season
    • Day 8 | Observable Multi-Modal Agentic Systems
    • Day 7 | SLOs that actually drive decisions
    • Day 6 | Stop cardinality from stealing your cloud budget
    • Day 5 | Using a Popular Tidying Method to Consolidate Your Observability Stack
    • Day 4 | Tracing the impact of feature flags in your Node.js app
    • Day 3 | Zero-Config Observability with OpenTelemetry
    • Day 2 | Why AI agents need three layers of observability
    • Day 1 | Observability Under the Tree: What Changed in 2025
    • 5 takeaways from my first PyCon JP conference
    • Dungeons & Downtimes: XP gained from our adventure
    • Reverse Proxy for custom domains
    • Adventures in dogfooding: Guarded Releases
    • A quick tool for npm package scanning
    • My DEF CON 33 experience
    • Make every launch a big deal
    • Fun with JS streams
    • Moonshots XXII: Hack to the Future recap
    • A tale of three rate limiters
    • My good friend Claude
    • My approach to React app architecture in 2025
    • Data isolation with ClickHouse row policies
    • Ingest and Visualization for OpenTelemetry Metrics
    • Alert Evaluations: Incremental Merges in ClickHouse
    • Optimizing ClickHouse: The Tactics That Worked for Us
    • Migrating from OpenSearch to ClickHouse
    • Revamping Privacy Mode: A Better Way to Obfuscate Sensitive Data
    • An open-source session replay benchmark
    • LLM-based Grouping of Errors
    • Building GitHub Enhanced Stacktraces
    • Vercel Edge Runtime Support
    • Finding Interesting Sessions with Markov Chains
    • Building Logging Integrations at LaunchDarkly
    • The Network Request Details Panel
    • Using Github as a Headless CMS
    • Your Source Maps Should Be Public
    • Supporting Outside Contributions at LaunchDarkly
    • Managing our design tokens at LaunchDarkly
    • Our Commitment to OpenTelemetry
    • The 5 Best Logging Libraries for Ruby
    • InfluxDB: Visualizing Millions of Customers' Metrics using a Time Series Database
    • 8 Tips to Help You Maximize Chrome DevTools
    • The Debugging Process and Techniques for Web Applications (Part 2/2)
    • 5 Best Node.js Logging Libraries
    • What are rage clicks and how to detect them
    • 5 Best Practices for Maintaining a Clean ReactJS App
    • Is Kafka the Key? The Evolution of LaunchDarkly's Ingest
    • What Is Full Stack Monitoring and How Does It Work?
    • The beauty of contact-first API design
    • What is Frontend Monitoring and What Tools Help You Do It?
    • 5 strategies to monitor the health of your web application
    • Configuring OpenSearch for a Write-Heavy Workload
    • Maximizing Our Machines: Worker Pools At LaunchDarkly
Sign inTry it free
LogoLogo
On this page
  • Rolling dice
  • The players
  • Primary On-Call
  • Secondary On-Call
  • Incident Manager:
  • Dungeon Master:
  • Playing the Game
  • Evil stirs in the night…
  • Investigating…
  • Deception
  • The Cavalry Arrives
  • Learning Datadog: Trace Explorer
  • Declare an incident?
  • Following Up
  • Recommendations
  • Thank You
Flagship blog

Dungeons & Downtimes: XP gained from our adventure

Was this page helpful?
Previous

Use Reverse Proxy for custom domain requests

Next
Built with

Published October 16th, 2025

Portrait of Will Chieng.

by Will Chieng, LaunchDarkly Engineer

*It was a peaceful Friday night / Saturday morning. Your laptop long tucked away in your bag, and you [probably, hopefully] tucked away in bed. It is currently 3 AM local time - yes, even for you folks on the East Coast, somehow.

Your phone starts ringing. Will you actually wake up? Roll for initiative.*

TL;DR: Play this with your team! Discover gaps as you role-play through the scenario, and then have a follow-up session to address questions in-depth.

  • Double-check your Pagerduty setup
  • Go through how to debug issues
  • Have fun!

Rolling dice

For most of us on the Metrics team, this was the first time we were on-call after hours at LaunchDarkly. Some of us are frontend engineers and haven’t debugged backend issues before, and some of us are backend engineers that haven’t debugged frontend issues before. So to prepare ourselves, we role-played a mock incident in the style of Dungeons and Dragons tabletop.

Photo of D4s, D6s, D8s, D20s in various shades of green in a fancy brown box, next to a soft looking emerald bag.

I can never remember how to roll initiative.

The players

Primary On-Call

  • Anthony
  • Hakan
  • Zakk * See below for the twist

Secondary On-Call

  • Baslyos
  • Liz

Incident Manager:

  • Tiffany

Dungeon Master:

  • Will

Playing the Game

The scenario starts off with everyone sleeping…

It was a peaceful Friday night / Saturday morning. Your laptop long tucked away in your bag, and you [probably, hopefully] tucked away in bed. It is currently 3 AM local time - yes, even for you folks on the east coast, somehow.

Your phone starts ringing. Will you actually wake up? Roll for initiative.

As primary on-call, Anthony, Hakan, and Zakk rolled a 20-sided dice (d20) to see if they passed a perception check - if they actually woke up and noticed their phone ringing.

Screenshot of a private Slack channel titled #temp-dungeons-and-downtimes-metrics-20250821. Team member Anthony has sent a message to "/roll 1d20" from a D&D dice roller Slack app, and received a 13.

There's a Slack plugin for everything these days.

Because they passed the check, I asked them to put their phones on Do-Not-Disturb mode, and actually paged them to double check that everyone had PagerDuty set up correctly:

Follow-up item: We discovered that some of our PagerDuty settings weren’t set up to bypass Do-Not-Disturb mode, and did not actually alert.

Evil stirs in the night…

While Anthony and Hakan were debugging, Zakk had other plans.

For you see, he did not share the same goals as the others.

Unlike the others, Zakk, you’re already awake, sitting in the darkness when your phone lights up. As you look up in the mirror, you see a devilish reflection grinning back at you. You hear a voice like your own:

“This is your chance to shine. Sabotage the others, be the hero, and take the glory of saving Metrics for yourself. Or better yet, watch the world burn.”

Do you resist the dark urge or do you embrace it?

(Spoiler alert: he wholeheartedly embraced the darkness and became the antagonist)

Zakk‘s first order of business was to impose a consequence on the team: GitHub is down and there’s a chance it won’t actually load.

Investigating…

Anthony looks at the alert message:

Triggered: success rate SLO Burn Rate Alert. For the 7-day target, burn rates of 14.65 and 37.04 were measured for the past 4h (long window) and 20m (short window), respectively. Burn Rate has exceeded for metrics success rate & requests were 5xx in last 4h. Error budget rate has exceeded 5% of the 7-day error budget which will lead to violation of success rate SLO. Notified @slack-ops-metrics

And noted this follow-up item:

Follow-up: What should I know about burn rates and how are they calculated?

He then looked for a Metrics dashboard, and discovered that there are multiple unrelated ones!

Screenshot of 6 / 58 dashboards matching the query 'metrics.

6 / 58 dashboards matching the query 'metrics.'
Follow-up: Which dashboards should I be looking at?

Anthony and Hakan took turns going through the traces:

Screenshot of server logs displaying 9 errors to GET/internal/projects/projkey/metrics URLs.

Roll 500 for Internal Server Error.'

And discovered an error message:

"Failed to query Athena"

Anthony then wanted to check up on Athena, but was met with a screen that suggested we didn’t have even have Athena access:

Screenshot of AWS Athena login page that promises 'start querying data instantly.' Seems like that should require at least casting a spell.

Athena? Is that an evil cleric?
Follow-up: How do I health-check Athena?

Hakan then checked our Airflow DAGs, and found that the DAGs were fine.

Deception

Zakk attempts to throw everyone off the scene and mislead everyone into looking at recent deploys instead of investigating the error message further.

Screenshot of Slack dice roller plugin thingy.

Zakk rolls a d20 and gets 18. Anthony rolls a d20 and gets a 14.

And we fell victim to his silver tongue!

With that, the heroes turned their attention to looking at recent deploys… before realizing they needed to find out which repository / service to look at. Something to follow up on. 😄

The Cavalry Arrives

The secondary on-call is paged! Liz successfully wakes up and responds to the page, but Baslyos unfortunately rolls too low (3/20), and continues peacefully sleeping away. 🛌💤

Learning Datadog: Trace Explorer

Liz demonstrated how to use trace explorer to determine whether it’s all coming from one endpoint, and whether it was affecting one or many customers.

And it was indeed one specific customer that was unable to load metric event activity!

Declare an incident?

Role-playing as characters in the scenario, the team debated whether to declare an incident. Zakk pointed out our policy is to declare an incident if there is any doubt, so we declared an incident.

We learned how to start an incident via Slack.

Tiffany arrives as incident manager:

Screenshot from Slack: INCIDENT CHANNEL. Tiffany being a boss and making the  definitive decision to declare an incident because everything involving metrics is on fire.

Everything involving Metrics is on fire? Uh oh.

The team discussed the severity and the next steps, and the incident ends.

Fin.

Following Up

We found that going through the Dungeons and Downtimes scenario was great for discovering issues, gaps, and questions, so Liz suggested a follow-up session where we dive in-depth on those specific questions.

Liz also noted that the endpoint we were investigating does not have any alerts configured yet, so we’re going to add those.

Recommendations

I hope you enjoyed reading this lengthy post about our adventures in Dungeons and Downtimes! We encourage you to run similar scenarios for your teams, and to follow up on any questions unearthed. The scenario had branched off in a different direction than what I originally prepared for (which is actually awesome - that makes it more interesting!).

Thank You

This session would not have gone as smoothly or as fun without the players:

  • Anthony and Hakan for discovering all the gaps and calling them out.

  • Baslyos for explaining our process and for showing us what happens if we decline a page.

  • Liz for teaching us how to page people, how to debug an issue using trace explorer, and pushing for a follow-up session to address questions in-depth.

  • Tiffany for the valuable feedback throughout the process, recording questions and screenshots, the Slack emoji, and encouraging people to share their screen while debugging.

  • Zakk for being a creative and entertaining villain. You gave us the most laughs!