Create and manage custom judges for online evals | LaunchDarkly

Overview

This topic explains how to create and manage custom judges for online evaluations in AI Configs.

Online evaluations in AI Configs score AI Config outputs in production by evaluating model responses with a judge. A judge is a specialized AI Config that uses a large language model (LLM) as a judge to evaluate responses and return numeric scores that represent quality signals such as accuracy, relevance, or toxicity.

Custom judges let organizations define what quality means for their own products and domains. With custom judges, teams can:

Standardize and reuse approved evaluation logic across AI Configs and environments
Measure domain-specific quality signals such as accuracy, relevance, or toxicity
Apply consistent quality criteria in monitoring, guarded rollouts, and experiments to detect regressions in production

You can attach custom judges to completion-mode AI Config variations in the LaunchDarkly UI. For other variations, invoke a custom judge programmatically using the AI SDK. To learn more, read Online evaluations in AI Configs.

SDK requirement

To record evaluation results from custom judges, you must use a LaunchDarkly AI SDK version that includes online evaluation support. If your SDK does not support online evaluations, judges can be attached but no evaluation metrics will appear.

If your application uses an earlier AI SDK version, judge results may still be available to application code, but evaluation metrics are not automatically recorded in LaunchDarkly. In these cases, teams can access judge properties directly and handle evaluation results manually.

Access control and prerequisites

Custom judges use the same access control model as other AI Configs. To create or edit a judge, you must have permission to create and update AI Configs in the project.

Online evaluations must be enabled for the project before evaluation metrics can be recorded or displayed. If online evaluations are not enabled, judges can be attached but no evaluation results will appear.

You can attach custom judges to completion-mode AI Config variations in the LaunchDarkly UI. For other variations, invoke a custom judge programmatically using the AI SDK.

Automatic recording of evaluation metrics is supported starting in:

Python AI SDK version 0.14.0
Node.js AI SDK version 0.16.1

Create and manage custom judges

Create custom judges from the Create AI Config dialog. When you create a judge, LaunchDarkly provides a default evaluation configuration that you can customize for your use case.

Create a custom judge

To create a custom judge:

Navigate to AI Configs.
Click Create AI Config.
Click Judge as the AI Config type. Judge mode is a specialized configuration used only for evaluation.
Enter a name and key for the judge.
(Optional) Select a maintainer.
Click Create.

LaunchDarkly creates a new judge AI Config with a default evaluation configuration that you can update.

Manage custom judges

After you create a custom judge, you can update its configuration, manage its variations, and attach it to AI Config variations. Judges use the same editing interface as other AI Configs, with judge-specific settings and restrictions.

Configure judge settings

Judge-specific settings are defined at the AI Config level and apply to all variations of the judge.

To update judge settings:

Navigate to AI Configs and select the judge.
Open the judge details page.

From the judge details page, you can:

Change the evaluation metric key. If the key does not already exist, LaunchDarkly creates a new metric and displays a warning before you save.
Select the evaluation metric key to open the metric details page.
Configure score inversion to indicate whether a score of 0.0 represents good or bad quality. Use inversion for metrics such as toxicity, where lower scores indicate better outcomes.

Evaluation metric keys use the $ld:ai:judge: event prefix to identify how evaluation scores are stored and aggregated. Metric keys do not need to be unique across judges. Teams can reuse a metric key across multiple judges to intentionally aggregate evaluation results.

Edit judge variations

Each judge includes one or more variations. You can edit judge variations similarly to other AI Configs, with the following restrictions:

You cannot view or edit model parameters or custom parameters.
You cannot attach tools.
You cannot attach judges to a judge.
The “Judges” section is hidden for judge variations.

These restrictions help preserve consistent evaluation behavior and prevent recursive or ambiguous evaluations.

Attach and manage judges

After creating a judge, attach it to one or more AI Config variations to evaluate model responses.

To attach a judge:

Navigate to the AI Config you want to evaluate.
Select the Variations tab.
Expand a variation.
In the “Judges” section, click Attach judges.
Select a judge.
Adjust the sampling percentage as needed.
Click Review and save.

After attaching a judge, you can:

Select the judge name to open the judge details page.
Select the evaluation metric key to open the metric details page.

Judges do not run independently. A judge evaluates responses only when it is attached to an AI Config variation and that variation receives live traffic.

You cannot attach multiple judges that produce the same evaluation metric key to a single variation. Evaluations respect the configured sampling percentage.

Attaching a judge does not immediately produce data. Evaluation results appear only after the variation receives live traffic.

Use evaluation results

Evaluation results from custom judges appear throughout LaunchDarkly as standard AI metrics. Judges return structured results with a numeric score and brief reasoning.

You do not need to define output formatting. LaunchDarkly enforces structured output so evaluation results can be reliably recorded and displayed as metrics. Each evaluation metric produces a single score between 0.0 and 1.0.

View results in Monitoring and Metrics

To view evaluation results in Monitoring:

Navigate to the AI Config with the attached judge.
Select the Monitoring tab.
Use the metric dropdown to select the evaluation metric key.

Charts display average scores over time and update as evaluations run.

To view evaluation results from the Metrics page:

Navigate to Metrics.
Select the Judge metrics tab to filter metrics with the $ld:ai:judge: prefix.
Select a metric to view its details and trends.

Use evaluation metrics in guardrails and experiments

Evaluation metrics produced by custom judges behave like other AI metrics.

You can:

Use evaluation metrics as guardrails in guarded rollouts to pause or revert releases when quality degrades.
Select evaluation metrics as goals in experiments to compare AI Config variations.
Use judge scores in your application’s execution logic to enforce custom guardrails at runtime, when the evaluation sampling rate is set to 100 percent and every model response is evaluated.

This allows teams to apply quality controls through guarded rollouts and experiments, and to enforce additional safeguards directly in application code when required.

Reference: judge configuration and evaluation formats

Create judge payload format

When you create a judge, LaunchDarkly creates a new AI Config in judge mode using the name and key you provide.

1 // AIConfigPost
2 {
3   key: "<string>",
4   name: "<string>",
5   maintainerId: "<string>",
6   maintainerTeamKey: "<string>",
7 
8   mode: "judge",
9   tags: ["ai", "judge"],
10 
11   evaluationMetricKey: "<string>",
12   isInverted: false,
13 
14   defaultVariation: {
15     key: "default",
16     name: "Default",
17     messages: [
18       { role: "system", content: "..." },
19       { role: "assistant", content: "..." }
20     ]
21   }
22 }

Judge output format

1 {
2   "score": 0.0,
3   "reasoning": "..."
4 }

Evaluation event format

LaunchDarkly records evaluation results as numeric metric values.

1 {
2   "$ld:ai:judge:<metricKey>": 0.0
3 }

Privacy and data handling

Online evaluations run using your configured model provider credentials. LaunchDarkly does not store or share prompts, model responses, or evaluation data outside your project.

For more information, read AI Configs and information privacy.