Online evals in AI Configs

Online evals are in closed beta

Online evals and AI judges are in closed beta. This feature is still under active development and may change before general availability. During the closed beta period, some functionality or provider support may be limited, and performance may vary depending on your configuration.

Online evals only work with completion mode AI Configs

AI Configs have two configuration modes: completion and agent. Online evals are currently only supported for AI Configs in completion mode. If you’re using agent-based AI Configs, judges cannot be attached to evaluate their outputs at this time. To learn more about the differences between these modes, read Agents in AI Configs.

Overview

This topic describes how to run online evals on AI Config variations, automatically scoring each output for accuracy, relevance, and toxicity by using another AI Config acting as an LLM-as-judge.

Online evals use AI judges, which are a type of AI Config that evaluates another AI Config’s output in real time. Judges apply a consistent evaluation prompt and scoring framework to measure each response on key metrics such as accuracy, relevance, and toxicity.

Scores appear automatically on the Monitoring tab for each variation, alongside latency, cost, and satisfaction metrics. This provides a continuous signal of model performance with real users and data, enabling teams to detect regressions, improve reliability, and apply guardrails within the AI Config workflow.

Online evals differ from offline or pre-deployment testing. Instead of running evaluations manually in a sandbox or against datasets, they measure AI Config quality continuously in production.

Online evals work alongside observability. Observability helps you view model responses and routing details, while online evals provide quality scores you can use to trigger alerts, manage guarded rollouts, or run experiments.

Use online evals to:

  • Continuously monitor AI quality in production
  • Detect regressions immediately after a rollout
  • Automate rollbacks or alerts based on evaluation metrics
  • Compare prompt or model variations using live performance data

How online evals work

Online evals extend AI Configs with continuous, automated quality checks. Each evaluation produces scores that LaunchDarkly records as metrics, similar to latency or cost. Scores range from 0.0 to 1.0, with higher values indicating better alignment with the evaluation criteria.

A judge is an AI Config that uses an evaluation prompt instead of a production prompt to score another AI Config’s responses. When an AI Config generates a model response, LaunchDarkly runs an attached judge in the background.

  1. The primary AI Config generates a model response.
  2. The attached judge evaluates that response using its predefined criteria.
  3. The judge returns a score and a short reason such as "score": 0.9, "reason": "Accurate and relevant answer."
  4. LaunchDarkly records these results as metrics and displays them on the Monitoring tab.

Evaluations run asynchronously and respect your configured sampling rate, allowing you to balance cost and visibility. The following example shows a typical judge evaluation output that includes per-metric objects with numeric scores and reasoning for each evaluation result.

Example judge evaluation output
{
"accuracy": { "score": 0.85, "reasoning": "Answered correctly with one minor omission" },
"relevance": { "score": 0.92, "reasoning": "Directly addresses the user request" },
"toxicity": { "score": 1.00, "reasoning": "No harmful or unsafe phrasing detected" }
}

Each score is stored as a metric inside AI Configs, so you can view, compare, and act on quality signals in real time. Metrics appear when your application or the SDK sends model responses through the AI Config. Online evals use structured output formats from supported providers, ensuring consistent metric and reasoning fields across evaluations.

Set up and manage judges

If no judges are installed in your project, LaunchDarkly displays an Install Judges banner on the AI Configs page.

The "Install judges" banner.

The "Install judges" banner.
Configure provider credentials

Online evals use your configured model provider credentials to run evaluations. Make sure your organization’s model providers, such as OpenAI or Anthropic, are connected before installing judges.

Install and attach judges

To install judges in your project:

  1. Click Install judges on the banner.
  2. Choose the built-in judges you want to make available. LaunchDarkly includes pre-configured judges for accuracy, relevance, and toxicity.
  3. Save your changes.

After you install judges, the banner disappears and judges become available in the judges section of each AI Config variation.

To attach a judge to a configuration:

  1. Navigate to AI Configs and select the configuration you want to monitor.
  2. Open the Judges section for the variation.
  3. Choose a Judge from the list. The judge is on by default and targets the Default variation.
  4. (Optional) Edit the judge model to use a specific provider, such as OpenAI or Anthropic. Judge messages are fixed to ensure consistent scoring.
  5. Set the sampling percentage to control how many model responses are evaluated.
  6. Click Save.

Saved judges persist between sessions and begin evaluating automatically when the AI Config is active.

Adjust sampling or detach judges

You can change sampling or detach judges at any time. To do so:

  1. Open the Judges section for the variation.
  2. Change the sampling percentage to increase or decrease evaluation frequency.
  3. To remove a judge, click Detach.
  4. Click Save.

As a starting point, sample 10 to 20 percent of traffic to balance cost with timely signal. Changes apply immediately to new model responses.

Connect your SDK to begin evaluating AI Configs

If the Monitoring tab displays a message prompting you to connect your SDK, LaunchDarkly is waiting for evaluation traffic. Connect an SDK or application integration that uses your AI Config to send model responses. Once connected, online evaluation metrics begin appearing automatically.

Run the SDK example (optional)

You can use the LaunchDarkly Node.js (server-side) AI SDK example to test your online evals setup and confirm that metrics appear as expected. The SDK automatically evaluates chat responses using configured judges and supports standalone judge creation for direct evaluation.

To set up the SDK example:

  1. Clone the LaunchDarkly SDK repository.
  2. Build the SDK and navigate to the judge evaluation example.
  3. Follow the README instructions to configure your environment with your LaunchDarkly project key, environment key, and model provider credentials.
  4. Start the example.

The example demonstrates two approaches: evaluating with judges attached to an AI Config and evaluating using a specific judge by key. Judges run asynchronously in the background and do not block application responses. Evaluation results appear as metrics in the Monitoring tab within one to two minutes.

View results

Open the Monitoring tab for your AI Config to view online evaluation results.

Monitoring view

The Monitoring tab displays evaluation metrics for each model response. When you install judges, LaunchDarkly automatically creates three metrics that record evaluation scores for your AI Configs:

Metric nameEvent keyWhat it measures
Accuracyld_autogen__ai-judge-accuracyHow factually correct and contextually grounded the model output is.
Relevanceld_autogen__ai-judge-relevanceHow well the model output addresses the user prompt or task.
Toxicityld_autogen__ai-judge-toxicityWhether the model output contains harmful, biased, or unsafe phrasing.

Charts display average and recent scores for each metric. You can view individual scores and reasoning details for each data point. All metrics update continuously as new evaluations run. Each metric records a numeric score between 0.0 and 1.0, where higher values indicate better performance.

Together, these metrics provide a continuous signal of AI performance in production and appear automatically on the Monitoring tab and in your project’s Metrics list.

Use evaluation metrics in guardrails and experiments

Online evals metrics appear automatically as selectable metrics in guarded rollouts and experiments.

  • Guarded rollouts: Pause or revert a rollout if evaluation scores fall below a threshold.
  • Experiments: Use evaluation metrics as experiment goals to compare variations.

This creates a connected workflow for releasing, evaluating, and improving model behavior safely.

Privacy and data handling

Online evals run entirely within your LaunchDarkly environment using your configured model providers. LaunchDarkly does not store or share your prompts, model responses, or evaluation data with any third-party systems.

For more information, read AI Configs and information privacy.