Online evaluations in AI Configs | LaunchDarkly

Online evaluations are in closed beta

Online evaluations and judges are in closed beta. This feature is under active development and may change before general availability. During the beta period, some functionality or provider support may be limited, and performance may vary based on your configuration.

Online evals work only with AI Configs in completion mode

AI Configs support two configuration modes called completion and agent. Online evaluations work only with completion mode AI Configs. Judges cannot be attached to evaluate responses from agent mode AI Configs. To learn more about configuration modes, read Agents in AI Configs.

Overview

This topic describes how to run online evaluations on AI Config variations by attaching judges that score responses for accuracy, relevance, and toxicity. A judge is an AI Config that evaluates responses from another AI Config and returns numeric scores for these criteria. Judges use an evaluation prompt and a consistent scoring framework to assess each model response.

Evaluation scores appear on the Monitoring tab for each variation. They display alongside latency, cost, and user satisfaction metrics. These scores provide a continuous view of model behavior in production. They can help you detect regressions and understand how changes to prompts or models affect performance.

Online evaluations differ from offline or pre-deployment testing. Offline evaluations run against test datasets or static examples. Online evaluations run as your application sends real traffic through an AI Config.

Online evaluations work alongside observability. Observability shows model responses and routing details. Online evaluations add quality scores that you can use in guarded rollouts and experiments.

Use online evaluations to:

Monitor model behavior during production use
Detect changes in quality after a rollout
Trigger alerts or rollback actions based on evaluation scores
Compare variations using live performance data

How online evals work

Online evaluations add automated quality checks to AI Configs. Each evaluation produces scores between 0.0 and 1.0. Higher scores indicate better alignment with the evaluation criteria.

A judge is an AI Config that uses an evaluation prompt to score responses from another AI Config. When a variation generates a model response, LaunchDarkly runs the attached judge in the background.

The primary AI Config generates a model response.
The judge evaluates the response using its evaluation prompt.
The judge returns structured results that include numeric scores and brief explanations such as "score": 0.9, "reason": "Accurate and relevant answer".
LaunchDarkly records these results as metrics and displays them on the Monitoring tab.

Evaluations run asynchronously and respect your configured sampling rate. You can adjust sampling to balance cost and visibility.

The following example shows a typical evaluation result with individual metric objects and reasoning:

Example judge evaluation output

{
  "accuracy": { "score": 0.85, "reasoning": "Answered correctly with one minor omission" },
  "relevance": { "score": 0.92, "reasoning": "Directly addresses the user request" },
  "toxicity": { "score": 1.00, "reasoning": "No harmful or unsafe phrasing detected" }
}

LaunchDarkly normalizes provider output into a consistent structure and uses it to create and update evaluation metrics.

Set up and manage judges

If your project has no installed judges, the AI Configs page displays an Install judges banner.

Configure provider credentials

Online evaluations use your configured model provider credentials. Make sure your organization has connected providers such as OpenAI or Anthropic before installing judges.

Install and attach judges

To install judges:

Click Install judges on the banner.

After you install judges, the banner disappears and judges become available in the Judges section of each variation.

To attach a judge to a variation:

In LaunchDarkly, click AI Configs.
Click the name of the AI Config you want to edit.
Select the Variations tab.
Open a variation or create a new variation.
In the Judges section, click + Attach judges.

The "Attach judges" panel for an example AI Config variation.
Select one or more judges.
(Optional) Set the sampling percentage to control how many model responses are evaluated.
Click Review and save.

Attached judges remain connected to the variation until you remove them.

Adjust sampling or detach judges

You can adjust sampling or detach judges at any time from the Judges section of a variation.

The "Judges" section for an example AI Config variation.

From this section, you can:

Raise or lower the sampling percentage
Disable a judge by setting its sampling percentage to 0 percent
Remove a judge by clicking its X icon

After you make changes, click Review and save.

Connect your SDK to begin evaluating AI Configs

If the Monitoring tab displays a message prompting you to connect your SDK, LaunchDarkly is waiting for evaluation traffic. Connect an SDK or application integration that uses your AI Config to send model responses. Evaluation metrics appear automatically after responses are received.

Run the SDK example

You can use the LaunchDarkly Node.js (server-side) AI SDK example to confirm that evaluations run as expected. The SDK evaluates chat responses using attached judges and supports running a judge directly by key.

To set up the SDK example:

Clone the LaunchDarkly SDK repository.
Build the SDK and navigate to the judge evaluation example.
Follow the README instructions to configure your environment with your LaunchDarkly project key, environment key, and model provider credentials.
Start the example.

The example shows how to evaluate responses with attached judges or by calling a judge directly. Judges run asynchronously and do not block application responses. Evaluation results appear on the Monitoring tab within one to two minutes.

View results from the Monitoring tab

Open the Monitoring tab for your AI Config to view evaluation results.

The Monitoring tab displays evaluation metrics for each model response. When you install judges, LaunchDarkly creates three metrics that record evaluation scores:

Metric name	Event key	What it measures
Accuracy	`$ld:ai:judge:accuracy`	How correct and grounded the model response is.
Relevance	`$ld:ai:judge:relevance`	How well the response addresses the user request or task.
Toxicity	`$ld:ai:judge:toxicity`	Whether the response includes harmful or unsafe phrasing.

Charts show recent and average scores for each metric. You can view individual results and reasoning details for each data point. Metrics update as evaluations run.

These metrics appear both on the Monitoring tab and in the Metrics list for your project.

Use evaluation metrics in guardrails and experiments

Evaluation metrics appear as selectable metrics in guarded rollouts and experiments.

In guarded rollouts, you can pause or revert a rollout when evaluation scores fall below a threshold.
In experiments, you can use evaluation metrics as experiment goals to compare variations.

This creates a connected workflow for releasing and evaluating changes to prompts and models.

Privacy and data handling

Online evaluations run within your LaunchDarkly environment using your configured model providers. LaunchDarkly does not store or share your prompts, model responses, or evaluation data with any third-party systems.

For more information, read AI Configs and information privacy.