For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Sign inTry it free
DocsGuidesSDKsIntegrationsAPI docsTutorialsFlagship blog
DocsGuidesSDKsIntegrationsAPI docsTutorialsFlagship blog
  • Tutorials
    • The AI Iteration Loop for Deploying Reliable Agents with LangGraph
    • Using LaunchDarkly feature flags and Experimentation with Wordpress
    • Migrate a Hardcoded LangGraph Agent to LaunchDarkly AgentControl in 20 Minutes
    • Offline Evaluation of RAG-Grounded Answers in AgentControl
    • Beyond n8n for Workflow Automation: Agent Graphs as Your Universal Agent Harness
    • Catch your first silent AI failure with Vega AI in under 10 minutes
    • Evaluate LLM code generation with LLM-as-judge evaluators
    • OpenTelemetry for LLM Applications: A Practical Guide with LaunchDarkly and Langfuse
    • Use LaunchDarkly Agent Skills in Claude Code and Cursor
    • Detection to Resolution: Real World Debugging with Rage Clicks and Session Replay
    • Compare AI orchestrators: LangGraph vs Strands vs OpenAI Swarm
    • Building a data extraction pipeline with LaunchDarkly
    • Day 12 | 🎊 New Year, New Observability
    • Day 11 | ✉️ Letters to Santa: What engineering teams really want from Observability in 2026
    • Day 10 | Why observability and feature flags go together like milk and cookies
    • Day 9 | 👻 The Three Ghosts Haunting Your AI This Holiday Season
    • Day 7 | 🎄✨The Rockefeller tree in NYC: SLOs that actually drive decisions
    • Day 6 | 💸 The famous green character that stole your cloud budget: the cardinality problem
    • Day 5 | 🧹 Using a Popular Tidying Method to Consolidate Your Observability Stack
    • Day 4 | ❄️ Tracing the impact of holiday styling in your Node.js app
    • Day 8 | 🎁 Observable Multi-Modal Agentic Systems
    • Day 3 | 🔔 Jingle All the Way to Zero-Config Observability
    • Day 2 | 🎅 He knows if you have been bad or good... But what if he gets it wrong?
    • Collecting user feedback in your app with feature flags
    • Day 1 | 🎄 Observability Under the Tree: What Changed in 2025
    • Build a User Frustration Detection & Response System
    • When to Add Online Evals to Your AgentControl
    • Detecting User Frustration: Understanding Rage Clicks and Session Replay
    • AgentControl config CI/CD Pipeline: Automated Quality Gates and Safe Deployment
    • A Deeper Look at LaunchDarkly Architecture: More than Feature Flags
    • Add Observability to Your React Native App in 5 minutes
    • Smart AI Agent Targeting with MCP Tools
    • Build a LangGraph Multi-Agent System in 20 Minutes with LaunchDarkly AgentControl
    • Snowflake Cortex Completion API + LaunchDarkly SDK Integration
    • Using AgentControl to review database changes
    • How to implement WebSockets and kill switches in a Python application
    • 4 hacks to turbocharge your Cursor productivity
    • Create a feature flag in your IDE in 5 minutes with LaunchDarkly's MCP server
    • Observability for Your Go ORM: OpenTelemetry Integration with GORM
    • The complete guide to OpenTelemetry in Next.js
    • How to instrument your React Native app with OpenTelemetry
    • The complete guide to OpenTelemetry in Python
    • Monitoring Browser Applications with OpenTelemetry
    • How to Use OpenTelemetry to Monitor Next.js Applications
    • What is OpenTelemetry and Why Should I Care?
    • Distributed Tracing in Next.js Apps
    • Tracing Distributed Systems in Next.js
    • Real-time Monitoring in Django: Essential Tools and Techniques
    • DeepSeek vs Qwen: local model showdown featuring LaunchDarkly AgentControl
    • Application Tracing in .NET for Performance Monitoring
    • The Ultimate Guide to Ruby Logging: Best Libraries and Practices
    • Using Materialized Views in ClickHouse (vs. Postgres)
    • Filtering and Sampling LaunchDarkly Ingest
    • How to Set Up Your Production AWS MSK Kafka Cluster
    • Publishing an NPM Package with Private pnpm Monorepo Dependencies
    • How To Use The Chrome Inspector & Debugger
    • 3 Levels of Data Validation in a Full Stack Application With React
    • The power of the monorepo: Keep your fullstack app in sync!
    • Compression: The simple, powerful upgrade for your web stack
    • Video tutorials
Sign inTry it free
LogoLogo
On this page
  • The quick decision framework
  • Online evals vs. LLM observability
  • LLM observability: your security camera
  • Online evals: your security guard
  • How online evals actually work
  • Real problems online evals solve
  • Example implementation path
  • The bottom line
  • Next steps
Tutorials

LLM evaluation guide: When to add online evals to your AI application

Was this page helpful?
Previous

Detecting User Frustration: Understanding rage clicks and session replay

Next
Built with

Published November 13th, 2025

Portrait of Scarlett Attensil.

by Scarlett Attensil

Newer features are available with AgentControl

This tutorial was published in November 2025, when online evaluations were limited to using three built-in judges. Since then, LaunchDarkly has shipped:

  • Custom judges: Write your own LLM-as-a-judge for domain-specific criteria like security, contract adherence, or scope discipline
  • Offline evaluations and Datasets: Run the same judges as regression tests against a saved input set before promoting a variation
  • Manual LLM span tracing: Instrument custom spans beyond auto-tracing for richer observability data feeding into evals

To learn more, read AgentControl.

The quick decision framework

Online evals provide real-time quality monitoring for LLM applications. Using LLM-as-a-judge methodology, they run automated quality checks on a configurable percentage of your production traffic, producing structured scores and pass/fail judgments you can act on programmatically. LaunchDarkly includes three built-in judges: accuracy, relevance, and toxicity.

Skip online evals if:

  • Your checks are purely deterministic (schema validation, compile tests)
  • You have low volume and can manually review outputs in observability dashboards
  • You’re primarily debugging execution problems

Add online evals when:

  • You need quantified quality scores to trigger automated actions (rollback, rerouting, alerts)
  • Manual quality review doesn’t scale to your traffic volume
  • You’re measuring multiple quality dimensions (accuracy, relevance, toxicity)
  • You want statistical quality trends across segments for AI governance and compliance
  • You need to monitor token usage and cost alongside quality metrics
  • You’re running A/B tests or guarded releases and need automated quality gates

Most teams add them within 2-3 sprints when manual quality review becomes the bottleneck. Configurable sampling rates let you balance evaluation coverage with cost and latency.

Online evals vs. LLM observability

LLM observability shows you what happened. Online evals automatically assess quality and trigger actions based on those assessments.

LLM observability: your security camera

LLM observability shows you everything that happened through distributed tracing: full conversations, tool calls, token usage, latency breakdowns, and cost attribution. Perfect for debugging and understanding what went wrong. But when you’re handling 10,000 conversations daily, manually reviewing them for quality patterns doesn’t scale.

Online evals: your security guard

Automatically scores every sampled request using LLM-as-a-judge methodology across your quality rubric (accuracy, relevance, toxicity) and takes action. Instead of exporting conversations to spreadsheets for manual review, you get real-time quality monitoring with drift detection that triggers alerts, rollbacks, or rerouting.

The 3 AM difference

Without evals: “Let’s meet tomorrow to review samples and decide if we should rollback.”

With evals: “Quality dropped below threshold, automatic rollback triggered, here’s what failed…”

How online evals actually work

LaunchDarkly’s online evals use LLM-as-a-judge methodology with three built-in judges you can configure directly in the dashboard. No code changes required.

Getting started:

  1. Install judges from the AgentControl menu
  2. Attach judges to AgentControl config variations
  3. Configure sampling rates (balance coverage with cost/latency)
  4. Evaluation metrics are automatically emitted as custom events
  5. Metrics are automatically available for A/B tests and guarded releases

What you get from each built-in judge:

Accuracy judge:

1{
2 "score": 0.85,
3 "reasoning": "Response correctly answered the question but missed one edge case regarding error handling"
4}

Relevance judge:

1{
2 "score": 0.92,
3 "reasoning": "Response directly addressed the user's query with appropriate context and examples"
4}

Toxicity judge:

1{
2 "score": 0.0,
3 "reasoning": "Content is professional and appropriate with no toxic language detected"
4}

Each judge returns a score from 0.0 to 1.0 plus reasoning that explains the assessment. LaunchDarkly’s built-in judges (accuracy, relevance, toxicity) have fixed evaluation criteria and are configured only by selecting the provider and model.

Configuration: Configure judges from the AgentControl menu in your LaunchDarkly dashboard. We provide three pre-configured judges out-of-the box (Accuracy, Relevance, Toxicity), and you can create your own custom judges. When configuring your config variations, select which judges to attach and set your desired sampling rate. You can also retrieve and invoke a judge programmatically using the AI SDK to evaluate input and output directly. Use different judge combinations or invocation patterns across environments to match your quality requirements and cost constraints.

Real problems online evals solve

Scale for production applications: Your SQL generator handles 50,000 queries daily. LLM observability shows you every query through distributed tracing. Online evals tell you the proportion that are semantically wrong, automatically, with hallucination detection built in.

Multi-dimensional quality monitoring: Customer service AI applications aren’t just “did it respond?” It’s accuracy, relevance, toxicity, compliance, and appropriateness. Online evals score all dimensions simultaneously, each with its own threshold and reasoning.

RAG pipeline validation: Your retrieval-augmented generation system needs continuous monitoring of both retrieval quality and generation accuracy. Online evals can assess whether retrieved context is relevant and whether the response accurately uses that context, preventing hallucinations and ensuring factual grounding.

Cost and performance optimization: Monitor token usage alongside quality metrics. If certain queries consume 10x more tokens than others, online evals help identify these patterns so you can optimize prompts or routing logic to reduce costs without sacrificing quality.

Actionable metrics for AI governance: Transform 10,000 responses from data to decisions with evaluator-driven quality gates:

  • Accuracy trending below 0.8? Automated alerts to the team
  • Toxicity above 0.2? Immediate review and potential rollback
  • Relevance dropping for specific user segments? Targeted configuration updates
  • Metrics automatically feed A/B tests and guarded releases for continuous improvement

Example implementation path

Week 1-2: Define quality dimensions and install judges. Use LLM observability alone first. Manually review samples to understand your system. Define your quality dimensions: accuracy, relevance, toxicity, or other criteria specific to your application. Install the built-in judges from the AgentControl menu in LaunchDarkly.

Week 3-4: Attach judges with sampling. Attach judges to config variations in LaunchDarkly. Start with one or two key judges (accuracy and relevance are good defaults). Configure sampling rates between 10-20% of traffic to balance coverage with cost and latency. Compare automated scores with human judgment to validate the judges work for your use case.

Week 5+: Operationalize with quality gates. Add more evaluation dimensions as you learn. Connect scores to automated actions and evaluator-driven quality gates: when accuracy drops below 0.7, trigger alerts; when toxicity exceeds 0.2, investigate immediately. Leverage the custom events and metrics for A/B testing and guarded releases to continuously improve your application’s performance.

The bottom line

You don’t need online evals on day one. Start with LLM observability to understand your AI system through distributed tracing. Add evaluations when you hear yourself saying “we need to review more conversations” or “how do we know if quality is degrading?”

LaunchDarkly’s three built-in judges (accuracy, relevance, toxicity) provide LLM-as-a-judge evaluation that you can attach to any config variation with configurable sampling rates. You can also invoke a judge programmatically using the AI SDK. When judges are attached in the UI, evaluation metrics are automatically emitted as custom events and feed directly into A/B tests and guarded releases, enabling continuous AI governance and quality improvement without code changes. Start simple with one judge, learn what matters for your application, and expand from there.

LLM observability is your security camera. Online evals are your security guard.

Next steps

Ready to get started? Sign up for a free LaunchDarkly account if you haven’t already.

Build a complete quality pipeline:

  • AgentControl CI/CD Pipeline - Add automated quality gates and LLM-as-a-judge testing to your deployment process
  • Offline Evaluation of RAG-Grounded Answers - Catch generation regressions with the same dataset before they reach production
  • Evaluate LLM code generation with LLM-as-judge evaluators - Build domain-specific custom judges to complement the built-in ones
  • Combine offline evaluation (in CI/CD) with online evals (in production) for comprehensive quality coverage

Learn more about AgentControl:

  • AgentControl documentation - Understand how configs enable real-time LLM configuration
  • Online evals documentation - Deep dive into judge installation and configuration

See it in action:

  • Check LLM observability in the LaunchDarkly dashboard to track your AI application performance with distributed tracing

Industry standards: LaunchDarkly’s approach aligns with emerging AI observability standards, including OpenTelemetry’s semantic conventions for AI monitoring, ensuring your evaluation infrastructure integrates with the broader observability ecosystem.