All I Want for Christmas is Observable Multi-Modal Agentic Systems: How Session Replay + Online Evals Revealed How My Holiday Pet App Actually Works

Published December 10th, 2025

Portrait of Scarlett Attensil.

by Scarlett Attensil

I added LaunchDarkly observability to my Christmas-play pet casting app thinking I’d catch bugs. Instead, I unwrapped the perfect gift 🎁. Session replay shows me WHAT users do, and online evaluations show me IF my model made the right casting decision with real-time accuracy scores. Together, they’re like milk πŸ₯› and cookies πŸͺ - each good alone, but magical together for production AI monitoring.

See the App in Action

Tip: Swipe or scroll horizontally to navigate through the screenshots

Discovery #1: Users’ 40-second patience threshold

I decided to use session replay to evaluate the average time it took users to go through each step in the AI casting process. Session replay is LaunchDarkly’s tool that records user interactions in your app - every click, hover, and page navigation - so you can watch exactly what users experience in real-time.

The complete AI casting process takes 30-45 seconds: personality analysis (2-3s), role matching (1-2s), DALL-E 3 costume generation (25-35s), and evaluation scoring (2-3s). That’s a long time to stare at a loading spinner wondering if something broke.

What are progress steps?

Progress steps are UI elements I added to the app - not terminal commands or backend processes, but actual visual indicators in the web interface that show users which phase of the AI generation is currently running. These appear as a simple list in the loading screen, updating in real-time as each AI task completes. No commands needed - they automatically display when the user clicks β€œGet My Role!” and the AI processing begins.

Session replay revealed:

WITHOUT Progress Steps (n=20 early sessions):
0-10 seconds: 20/20 still watching (100%)
10-20 seconds: 18/20 still watching (90%)
20-30 seconds: 14/20 still watching (70%) - rage clicks begin
30-40 seconds: 9/20 still watching (45%) - tab switching detected
40+ seconds: 7/20 still watching (35% stay)
WITH Progress Steps (n=30 after adding them):
0-10 seconds: 30/30 still watching (100%)
10-20 seconds: 29/30 still watching (97%)
20-30 seconds: 25/30 still watching (83%)
30-40 seconds: 23/30 still watching (77%)
40+ seconds: 24/30 still watching (80% stay!)
Critical Discovery: Progress steps more than DOUBLED
completion rate (35% β†’ 80%)

This made the difference:

Clear progress steps:

Step 1: AI Casting Decision
Step 2: Generating Costume Image (10-30s)
Step 3: Evaluation
As each completes:
βœ… Step 1: AI Casting Decision
Step 2: Generating Costume Image (10-30s)
Step 3: Evaluation

Session replay showed users hovering over the back button at 25 seconds, then relaxing when they saw β€œStep 2: Generating Costume Image (10-30s).” The moment they understood DALL-E was creating their pet’s costume (not the app freezing), they were willing to wait. Clear progress indicators transform anxiety into patience.

Discovery #2: Observability + online evaluations give the complete picture

Session replay shows user behavior and experience. Online evaluations expose AI output quality through accuracy scoring. Together, they form a solid strategy for AI observability.

To see this in action, let’s take a closer look at an example.

Example: The speed-running corgi owner

In this scenario, a user blazes through the entire pet app setup from the initial quiz to the final results, completing the process in record time. So fast, in fact, that instead of this leading to a favorable outcome, it led to an instance of speed killing quality.

Session Replay Showed:

  • Quiz completed in 8 seconds (world record) - they clicked the first option for every question
  • Skipped photo upload entirely
  • Waited the full 31 seconds for processing
  • Got their result: β€œSheep”
  • Started rage clicking on the sheep image immediately
  • Left the site without saving or sharing

Why did their energetic corgi get cast as a sheep? The rushed quiz responses created a contradictory personality profile that confused the AI. Without a photo to provide visual context, the model defaulted to its safest, most generic casting choice.

Online Evaluation Results:

  • Evaluation Score: 38/100 ❌
  • Reasoning: β€œCostume contains unsafe elements: eyeliner, ribbons”
  • Wait, what? The AI suggested face paint and ribbons, evaluation said NO

Online evaluations use a model-agnostic evaluation (MAE) - an AI agent that evaluates other AI outputs for quality, safety, or accuracy. The out-of-the-box evaluation judge is overly cautious about physical safety. For the above scenario, the evaluation comments:

  • β€œCostume includes eyeliner which could be harmful to pets” (It’s a DALL-E image!)
  • β€œRibbons pose entanglement risk”
  • β€œBells are a choking hazard” (It’s AI-generated art!)

About 40% of low scores are actually the evaluation being overprotective about imaginary safety issues, not bad casting.

Speed-runners get generic roles AND the evaluation writes safety warnings about digital costumes. Users see these low scores and think the app doesn’t work well.

But speed-running isn’t the whole story. To truly understand the relationship between user engagement and AI quality, we need to see the flip side. The perfect user. One who gives the AI everything it needed to succeed. What happens when a user takes their time and engages thoughtfully with every step?

Example: The perfect match

Session Replay Showed:

  • 45 seconds on quiz (reading each option)
  • Uploaded photo, waited for processing
  • Spent 2 minutes on results page
  • Downloaded image multiple times

Online Evaluation Results:

  • Evaluation Score: 96/100 ⭐⭐⭐⭐⭐
  • Reasoning: β€œPersonality perfectly matches role archetype”
  • Photo bonus: β€œVisual traits enhanced casting accuracy”

Time invested = Quality received. The AI rewards thoughtfulness.

Discovery #3: The photo upload comedy gold mine

Session replay revealed what photos people ACTUALLY upload. Without it, you’d never know that one in three photo uploads are problematic, and you’d be flying blind on whether to add validation or trust your model.

Example: The surprising photo upload analysis

Session Replay Showed:

Photo Upload Analysis (n=18 who uploaded):
- 12 (67%) Normal pet photos
- 2 (11%) Screenshots of pet photos on their phone
- 1 (6%) Multiple pets in one photo (chaos)
- 1 (6%) Blurry "pet in motion" disaster
- 1 (6%) Stock photo of their breed (cheater!)

Despite 33% problematic inputs, evaluation scores remained high (87-91/100). The AI is remarkably resilient.

Example: When β€œbad” photos produce great results

My Favorite Session: Someone uploaded a photo of their cat mid-yawn. The AI vision model described it as β€œdisplaying fierce predatory behavior.” The cat was cast as a β€œProtective Father.” Evaluation score: 91/100. The owner downloaded it immediately.

The Winner: Someone’s hamster photo that was 90% cage bars. The AI somehow extracted β€œsmall fuzzy creature behind geometric patterns” and cast it as β€œShepherd” because β€œclearly experienced at navigating barriers.” Evaluation score: 87/100.

Without session replay, you’d only see evaluation scores and think β€œthe AI is working well.” But session replay reveals users are uploading screenshots and blurry photosβ€”input quality issues that could justify adding photo validation.

However, the high evaluation scores prove the AI handles imperfect real-world data gracefully. This insight saved me from over-engineering photo validation that would have slowed down the user experience for minimal quality gains.

Session replay + online evaluations together answered the question β€œShould I add photo validation?” The answer: No. Trust the model’s resilience and keep the experience frictionless.

The magic formula: Why this combo works (and what surprised me)

Without Observability:

  • β€œThe app seems slow” β†’ Β―\(ツ)/Β―
  • β€œWe have 20 visitors but 7 completions” β†’ Where do they drop?

With Session Replay ONLY:

  • β€œUser got sheep and rage clicked; maybe left angry” β†’ Was this a bad match?

With Model-Agnostic Evaluation ONLY:

  • β€œEvaluation: 22/100 - Eyeliner unsafe for pets” β†’ How did the user react?
  • β€œEvaluation: 96/100 - Perfect match!” β†’ How did this compare to the image they uploaded?

With BOTH:

  • β€œUser rushed, got sheep with ribbons, evaluation panicked about safety” β†’ The OOTB evaluation treats image generation prompts like real costume instructions

  • β€œ40% of low scores are costume safety, not bad matching” β†’ Need custom evaluation criteria (coming soon!)

  • β€œUsers might think low score = bad casting, but it’s often = protective evaluation” β†’ Would benefit from custom evaluation criteria to avoid this confusion

The evaluation thinks we’re putting actual ribbons on actual cats. It doesn’t realize these are AI-generated images. So when the casting suggests β€œsparkly collar with bells,” the evaluation judge practically calls animal services.

Now that you’ve seen what’s possible when you combine user behavior tracking with AI quality scoring, let’s walk through how to add this same observability magic to your own multi-modal AI app.

Your turn: See the complete picture

Want to add this observability magic to your own app? Here’s how:

1. Install the packages

$npm install @launchdarkly/observability
>npm install @launchdarkly/session-replay

2. Initialize with observability

1import { initialize } from 'launchdarkly-js-client-sdk';
2import Observability from '@launchdarkly/observability';
3import SessionReplay from '@launchdarkly/session-replay';
4
5const ldClient = initialize(clientId, user, {
6 plugins: [
7 new Observability(),
8 new SessionReplay({
9 privacySetting: 'strict' // Masks all data on the page - see https://launchdarkly.com/docs/sdk/features/session-replay-config#expand-javascript-code-sample
10 })
11 ]
12});

3. Configure online evaluations in dashboard

Install Judges

Install evaluation judges in your AI Config
  1. Create your AI Config in LaunchDarkly for LLM evaluation
  2. Enable automatic accuracy scoring for production monitoring

Configure Judges

Configure judges for accuracy scoring
  1. Set accuracy weight to 100% for production AI monitoring
  2. Monitor your AI outputs with real-time evaluation scoring

4. Connect the dots

Session replay shows you:

  • Where users drop off
  • What confuses them
  • When they rage click
  • How long they wait

Online evaluations show you:

  • AI decision accuracy scores
  • Why certain outputs scored low
  • Pattern of good vs bad castings
  • Safety concerns (even for pixels!)

Together they reveal the complete story of your AI app.

Resources to get started:

Full Implementation Guide - See how this pet app implements both features

Session Replay Tutorial - Official LaunchDarkly guide for detecting user frustration

When to Add Online Evals - Learn when and how to implement AI evaluation

The real magic is in having observability AND online evaluations.

Try it yourself

Cast your pet: https://scarlett-critter-casting.onrender.com/

See your evaluation score ⭐. Understand why your cat is a shepherd and your dog is an angel. The AI has spoken, and now you can see exactly how much to trust it!


Ready to add AI observability to your multi-modal agents?

Don’t let your AI operate in the dark this holiday season. Get complete visibility into your multi-modal AI systems with LaunchDarkly’s online evaluations and session replay.

Get started: Sign up for a free trial β†’ Create your first AI Config β†’ Enable session replay and online evaluations β†’ Ship with confidence.

Further reading

LaunchDarkly resources:

Related tutorials: