LLM evaluation guide: When to add online evals to your AI application

Published November 13th, 2025

by Scarlett Attensil

Newer features are available with AgentControl

This tutorial was published in November 2025, when online evaluations were limited to using three built-in judges. Since then, LaunchDarkly has shipped:

Custom judges: Write your own LLM-as-a-judge for domain-specific criteria like security, contract adherence, or scope discipline
Offline evaluations and Datasets: Run the same judges as regression tests against a saved input set before promoting a variation
Manual LLM span tracing: Instrument custom spans beyond auto-tracing for richer observability data feeding into evals

To learn more, read AgentControl.

The quick decision framework

Online evals provide real-time quality monitoring for LLM applications. Using LLM-as-a-judge methodology, they run automated quality checks on a configurable percentage of your production traffic, producing structured scores and pass/fail judgments you can act on programmatically. LaunchDarkly includes three built-in judges: accuracy, relevance, and toxicity.

Skip online evals if:

Your checks are purely deterministic (schema validation, compile tests)
You have low volume and can manually review outputs in observability dashboards
You’re primarily debugging execution problems

Add online evals when:

You need quantified quality scores to trigger automated actions (rollback, rerouting, alerts)
Manual quality review doesn’t scale to your traffic volume
You’re measuring multiple quality dimensions (accuracy, relevance, toxicity)
You want statistical quality trends across segments for AI governance and compliance
You need to monitor token usage and cost alongside quality metrics
You’re running A/B tests or guarded releases and need automated quality gates

Most teams add them within 2-3 sprints when manual quality review becomes the bottleneck. Configurable sampling rates let you balance evaluation coverage with cost and latency.

Online evals vs. LLM observability

LLM observability shows you what happened. Online evals automatically assess quality and trigger actions based on those assessments.

LLM observability: your security camera

LLM observability shows you everything that happened through distributed tracing: full conversations, tool calls, token usage, latency breakdowns, and cost attribution. Perfect for debugging and understanding what went wrong. But when you’re handling 10,000 conversations daily, manually reviewing them for quality patterns doesn’t scale.

Online evals: your security guard

Automatically scores every sampled request using LLM-as-a-judge methodology across your quality rubric (accuracy, relevance, toxicity) and takes action. Instead of exporting conversations to spreadsheets for manual review, you get real-time quality monitoring with drift detection that triggers alerts, rollbacks, or rerouting.

The 3 AM difference

Without evals: “Let’s meet tomorrow to review samples and decide if we should rollback.”

With evals: “Quality dropped below threshold, automatic rollback triggered, here’s what failed…”

How online evals actually work

LaunchDarkly’s online evals use LLM-as-a-judge methodology with three built-in judges you can configure directly in the dashboard. No code changes required.

Getting started:

Install judges from the AgentControl menu
Attach judges to AgentControl config variations
Configure sampling rates (balance coverage with cost/latency)
Evaluation metrics are automatically emitted as custom events
Metrics are automatically available for A/B tests and guarded releases

What you get from each built-in judge:

Accuracy judge:

1 {
2   "score": 0.85,
3   "reasoning": "Response correctly answered the question but missed one edge case regarding error handling"
4 }

Relevance judge:

1 {
2   "score": 0.92,
3   "reasoning": "Response directly addressed the user's query with appropriate context and examples"
4 }

Toxicity judge:

1 {
2   "score": 0.0,
3   "reasoning": "Content is professional and appropriate with no toxic language detected"
4 }

Each judge returns a score from 0.0 to 1.0 plus reasoning that explains the assessment. LaunchDarkly’s built-in judges (accuracy, relevance, toxicity) have fixed evaluation criteria and are configured only by selecting the provider and model.

Configuration: Configure judges from the AgentControl menu in your LaunchDarkly dashboard. We provide three pre-configured judges out-of-the box (Accuracy, Relevance, Toxicity), and you can create your own custom judges. When configuring your config variations, select which judges to attach and set your desired sampling rate. You can also retrieve and invoke a judge programmatically using the AI SDK to evaluate input and output directly. Use different judge combinations or invocation patterns across environments to match your quality requirements and cost constraints.

Real problems online evals solve

Scale for production applications: Your SQL generator handles 50,000 queries daily. LLM observability shows you every query through distributed tracing. Online evals tell you the proportion that are semantically wrong, automatically, with hallucination detection built in.

Multi-dimensional quality monitoring: Customer service AI applications aren’t just “did it respond?” It’s accuracy, relevance, toxicity, compliance, and appropriateness. Online evals score all dimensions simultaneously, each with its own threshold and reasoning.

RAG pipeline validation: Your retrieval-augmented generation system needs continuous monitoring of both retrieval quality and generation accuracy. Online evals can assess whether retrieved context is relevant and whether the response accurately uses that context, preventing hallucinations and ensuring factual grounding.

Cost and performance optimization: Monitor token usage alongside quality metrics. If certain queries consume 10x more tokens than others, online evals help identify these patterns so you can optimize prompts or routing logic to reduce costs without sacrificing quality.

Actionable metrics for AI governance: Transform 10,000 responses from data to decisions with evaluator-driven quality gates:

Accuracy trending below 0.8? Automated alerts to the team
Toxicity above 0.2? Immediate review and potential rollback
Relevance dropping for specific user segments? Targeted configuration updates
Metrics automatically feed A/B tests and guarded releases for continuous improvement

Example implementation path

Week 1-2: Define quality dimensions and install judges. Use LLM observability alone first. Manually review samples to understand your system. Define your quality dimensions: accuracy, relevance, toxicity, or other criteria specific to your application. Install the built-in judges from the AgentControl menu in LaunchDarkly.

Week 3-4: Attach judges with sampling. Attach judges to config variations in LaunchDarkly. Start with one or two key judges (accuracy and relevance are good defaults). Configure sampling rates between 10-20% of traffic to balance coverage with cost and latency. Compare automated scores with human judgment to validate the judges work for your use case.

Week 5+: Operationalize with quality gates. Add more evaluation dimensions as you learn. Connect scores to automated actions and evaluator-driven quality gates: when accuracy drops below 0.7, trigger alerts; when toxicity exceeds 0.2, investigate immediately. Leverage the custom events and metrics for A/B testing and guarded releases to continuously improve your application’s performance.

The bottom line

You don’t need online evals on day one. Start with LLM observability to understand your AI system through distributed tracing. Add evaluations when you hear yourself saying “we need to review more conversations” or “how do we know if quality is degrading?”

LaunchDarkly’s three built-in judges (accuracy, relevance, toxicity) provide LLM-as-a-judge evaluation that you can attach to any config variation with configurable sampling rates. You can also invoke a judge programmatically using the AI SDK. When judges are attached in the UI, evaluation metrics are automatically emitted as custom events and feed directly into A/B tests and guarded releases, enabling continuous AI governance and quality improvement without code changes. Start simple with one judge, learn what matters for your application, and expand from there.

LLM observability is your security camera. Online evals are your security guard.

Next steps

Ready to get started? Sign up for a free LaunchDarkly account if you haven’t already.

Build a complete quality pipeline:

AgentControl CI/CD Pipeline - Add automated quality gates and LLM-as-a-judge testing to your deployment process
Offline Evaluation of RAG-Grounded Answers - Catch generation regressions with the same dataset before they reach production
Evaluate LLM code generation with LLM-as-judge evaluators - Build domain-specific custom judges to complement the built-in ones
Combine offline evaluation (in CI/CD) with online evals (in production) for comprehensive quality coverage

Learn more about AgentControl:

AgentControl documentation - Understand how configs enable real-time LLM configuration
Online evals documentation - Deep dive into judge installation and configuration

See it in action:

Check LLM observability in the LaunchDarkly dashboard to track your AI application performance with distributed tracing

Industry standards: LaunchDarkly’s approach aligns with emerging AI observability standards, including OpenTelemetry’s semantic conventions for AI monitoring, ensuring your evaluation infrastructure integrates with the broader observability ecosystem.