This topic explains how to create and manage custom judges for online evaluations in AgentControl.
A judge is a specialized AgentControl config that uses a large language model (LLM) as a judge to evaluate responses and return numeric scores that represent quality signals related to accuracy, relevance, toxicity, or custom signals you define. AgentControl’s online evaluations use judges to evaluate model responses and score config outputs in production.
AgentControl offers three built-in judges, which you can configure to evaluate different response criteria. You can also create custom judges to evaluate additional or domain-specific quality signals using the same evaluation framework.
The built-in judges are:
Custom judges let organizations define what quality means for their own products and domains. With custom judges, teams can:
You can attach custom judges to completion-mode config variations in the LaunchDarkly UI. For other variations, invoke a custom judge programmatically using the AI SDK. To learn more, read Online evaluations.
When you create a judge, LaunchDarkly provides a default evaluation configuration that you can customize for your use case. You can create a custom judge from the “Create AgentControl config” menu.
Custom judges use the same access control model as other configs. To create or edit a judge, you must have permission to create and update AgentControl configs in the project.
You must enable online evaluations for the LaunchDarkly project before evaluation metrics will record or display. If online evaluations are not enabled, you can attach a judge to a config variation but no evaluation results will appear.
Support for automatic evaluation metric recording starts in these SDK versions:
To record evaluation results from custom judges, you must use a LaunchDarkly AI SDK version that includes online evaluations. If your SDK does not support online evaluations, you can attach custom judges but no evaluation metrics will appear.
If your application uses an earlier AI SDK version, judge results may still be available to application code, but evaluation metrics are not automatically recorded in LaunchDarkly. In these cases, teams can access judge properties directly and handle evaluation results manually.
To create a custom judge:
After you create a judge, it appears in the Judges tab in the Configs page.
New custom judges appear with a default evaluation configuration that you can update.
After you create a custom judge, you can update its configuration, manage its variations, and attach it to config variations. Judges use the same editing interface as other configs, with judge-specific settings and restrictions.
Judges have similar configuration options to other AgentControl configs. To learn more, read Create config variations.
Judges also include additional configuration options. From the judge configuration page, you can:
$ld:ai:judge:EXAMPLE-JUDGE-KEY event prefix to identify how evaluation scores are stored and aggregated. Metric keys do not need to be unique across judges. Teams can reuse a metric key across multiple judges to aggregate evaluation results intentionally.Judges also have the following restrictions:
These restrictions help maintain consistent evaluation behavior and prevent recursive or ambiguous evaluations.
By default, judges have one variation. To add more, click to collapse the first variation and then click Add variation.
Judges do not run independently. A judge evaluates responses only when it is attached to a config variation and that variation receives live traffic.
After you create a judge, attach it to one or more config variations to evaluate model responses.
To attach a judge to a config variation:
After attaching a judge, you can:
You cannot attach multiple judges that produce the same evaluation metric key to a single variation. Evaluations respect the configured sampling percentage.
Attaching a judge does not immediately produce data. Evaluation results appear only after the variation receives live traffic.
Evaluation results from custom judges appear throughout LaunchDarkly as standard AI metrics. Judges return structured results with a numeric score and brief reasoning.
You do not need to define output formatting. LaunchDarkly enforces structured output so evaluation results can be reliably recorded and displayed as metrics. Each evaluation metric produces a single score between 0.0 and 1.0.
To view evaluation results in monitoring:
Charts display average scores over time and update as evaluations run.
To view evaluation results from the Metrics page:
$ld:ai:judge: prefix.Evaluation metrics produced by custom judges behave like other AI metrics.
You can:
This allows teams to apply quality controls through guarded rollouts and experiments, and to enforce additional safeguards directly in application code when required.
When you create a judge, LaunchDarkly creates a new config in judge mode using the name and key you provide.
LaunchDarkly records evaluation results as numeric metric values.
In addition to attaching judges in the UI, you can evaluate arbitrary input and output pairs directly using the AI SDK and a judge key.
This approach:
The following Python example shows how to evaluate input and output directly using a judge:
For a complete, production-ready example including initialization checks and error handling, read the Python create_judge example on GitHub.
In this example:
create_judge() retrieves the judge configuration for the provided context.evaluate() scores the input and output pair.If you want to record evaluation scores as metrics associated with a config, explicitly track the returned evaluation scores in your application code using the config’s tracker. Programmatic judge evaluation does not automatically emit monitoring metrics.
Programmatic judge evaluation does not attach judges to variations in the UI and does not automatically enable Monitoring tab metrics, guarded rollout integration, or experiment metric selection.
Online evaluations run using your configured model provider credentials. LaunchDarkly does not store or share prompts, model responses, or evaluation data outside your project.
For more information, read AgentControl and information privacy.