Judges

This topic explains how to create and manage custom judges for online evaluations in AgentControl.

A judge is a specialized AgentControl config that uses a large language model (LLM) as a judge to evaluate responses and return numeric scores that represent quality signals related to accuracy, relevance, toxicity, or custom signals you define. AgentControl’s online evaluations use judges to evaluate model responses and score config outputs in production.

AgentControl uses two categories of judges to evaluate model responses: built-in judges and custom judges.

Built-in judges

AgentControl includes three built-in judges, each scoped to a specific quality signal:

Accuracy checks that a response is correct.
Relevance evaluates how well a response is scoped to the input.
Toxicity assesses whether a response has undesirable elements.

By default, each built-in judge evaluates responses against the selected model’s understanding of that criterion rather than a LaunchDarkly-defined standard. For example, the model will decide what counts as accurate, relevant, or non-toxic. If you need evaluation criteria tailored to your organization, you can override the default behavior for any built-in judge with your own snippets, or use a custom judge.

Custom judges

A custom judge can evaluate any quality signal you can describe in natural language. This also means custom judges aren’t limited to edge cases; they extend evaluation to criteria the built-in judges don’t cover. Common use cases include:

Formatting and structure checks that the response follows a required template or format.
Brand voice and tone evaluates how well the response matches your organization’s tone and style guidelines.
Messaging compliance confirms the response includes or avoids specific required language, such as taglines or disclaimers.
Domain-specific criteria measures the response against standards unique to your product or industry.
External tool validation verifies that the response correctly references or calls an expected tool.

You can attach custom judges to completion-mode and agent-mode config variations directly in the LaunchDarkly UI. You can also invoke a custom judge programmatically using the AI SDK. To learn more, read Online evaluations.

When you create a judge, LaunchDarkly provides a default evaluation configuration that you can customize for your use case. You can create a custom judge from the “Create AgentControl config” menu.

Access control and prerequisites

Custom judges use the same access control model as other configs. To create or edit a judge, you must have permission to create and update AgentControl configs in the project.

You must enable online evaluations for the LaunchDarkly project before evaluation metrics will record or display. If online evaluations are not enabled, you can attach a judge to a config variation but no evaluation results will appear.

Support for automatic evaluation metric recording starts in these SDK versions:

Python AI SDK version 0.14.0
Node.js AI SDK version 0.16.1

Your AI SDK must support online evaluations to use custom judges

To record evaluation results from custom judges, you must use a LaunchDarkly AI SDK version that includes online evaluations. If your SDK does not support online evaluations, you can attach custom judges but no evaluation metrics will appear.

If your application uses an earlier AI SDK version, judge results may still be available to application code, but evaluation metrics are not automatically recorded in LaunchDarkly. In these cases, teams can access judge properties directly and handle evaluation results manually.

Create a custom judge

To create a custom judge:

In the left sidebar, click Agents. The AgentControl menu appears.
Click Configs.
Click Create config.
Choose Judge as the config type.
Enter a name for the judge. The judge’s config and event keys generate automatically based on the name.
In the “Provider” dropdown, choose an LLM provider.
In the “Judge type” dropdown, choose “Custom”.
(Optional) Add any views from the “Views” dropdown.
Click Create. The judge’s config page appears.

After you create a judge, it appears in the Judges tab in the Configs page.

Configure custom judges

New custom judges appear with a default evaluation configuration that you can update.

After you create a custom judge, you can update its configuration, manage its variations, and attach it to config variations. Judges use the same editing interface as other configs, with judge-specific settings and restrictions.

Judges have similar configuration options to other AgentControl configs. To learn more, read Create config variations.

Judges also include additional configuration options. From the judge configuration page, you can:

Change or copy the judge’s event key. Evaluation metric keys use the $ld:ai:judge:EXAMPLE-JUDGE-KEY event prefix to identify how evaluation scores are stored and aggregated. Metric keys do not need to be unique across judges. Teams can reuse a metric key across multiple judges to aggregate evaluation results intentionally.
Set the desired direction for evaluation results. This defines which direction evaluation results should trend for a passing result. Configure this setting in the right sidebar of the judge’s configuration page.

Judges also have the following restrictions:

You cannot attach tools to a judge.
You cannot attach judges to a judge.

These restrictions help maintain consistent evaluation behavior and prevent recursive or ambiguous evaluations.

By default, judges have one variation. To add more, click to collapse the first variation and then click Add variation.

Attach judges to config variations

Judges do not run independently. A judge evaluates responses only when it is attached to a config variation and that variation receives live traffic.

After you create a judge, attach it to one or more config variations to evaluate model responses.

To attach a judge to a config variation:

Navigate to the config you want to evaluate.
Select the Variations tab.
Expand a variation.
At the bottom of the variation, click Add judges.
Select a judge, then click Add 1 judge.
Adjust the sampling percentage as needed.
Click Review and save.
In the “Review changes” dialog, click Save changes.

After attaching a judge, you can:

Select the judge name to open the judge details page.
Select the evaluation metric key to open the metric details page.

You cannot attach multiple judges that produce the same evaluation metric key to a single variation. Evaluations respect the configured sampling percentage.

Attaching a judge does not immediately produce data. Evaluation results appear only after the variation receives live traffic.

Use evaluation results

Evaluation results from custom judges appear throughout LaunchDarkly as standard AI metrics. Judges return structured results with a numeric score and brief reasoning.

You do not need to define output formatting. LaunchDarkly enforces structured output so evaluation results can be reliably recorded and displayed as metrics. Each evaluation metric produces a single score between 0.0 and 1.0.

View results

To view evaluation results in monitoring:

Navigate to the config with the attached judge.
Select the Monitoring tab.
Use the metric dropdown to select the evaluation metric key.

Charts display average scores over time and update as evaluations run.

To view evaluation results from the Metrics page:

Navigate to Metrics.
Select the Judge metrics tab to filter metrics with the $ld:ai:judge: prefix.
Select a metric to view its details and trends.

Use evaluation metrics in guardrails and experiments

Evaluation metrics produced by custom judges behave like other AI metrics.

You can:

Use evaluation metrics as guardrails in guarded rollouts to pause or revert releases when quality degrades.
Select evaluation metrics as goals in experiments to compare config variations.
Use judge scores in your application’s execution logic to enforce custom guardrails at runtime, when the evaluation sampling rate is set to 100% and every model response is evaluated.

This allows teams to apply quality controls through guarded rollouts and experiments, and to enforce additional safeguards directly in application code when required.

Reference: judge configuration and evaluation formats

Create judge payload format

When you create a judge, LaunchDarkly creates a new config in judge mode using the name and key you provide.

JSON: Judge configuration payload

1 {
2   "key": "<string>",
3   "name": "<string>",
4   "maintainerId": "<string>",
5   "maintainerTeamKey": "<string>",
6 
7   "mode": "judge",
8   "tags": ["ai", "judge"],
9 
10   "evaluationMetricKey": "<string>",
11   "isInverted": false,
12 
13   "defaultVariation": {
14     "key": "default",
15     "name": "Default",
16     "messages": [
17       { "role": "system", "content": "..." },
18       { "role": "assistant", "content": "..." }
19     ]
20   }
21 }

Judge output format

JSON: Judge output format

1 {
2   "score": 0.0,
3   "reasoning": "..."
4 }

Evaluation event format

LaunchDarkly records evaluation results as numeric metric values.

JSON: Judge evaluation event format

1 {
2   "$ld:ai:judge:<metricKey>": 0.0
3 }

Run judge evaluations programmatically

In addition to attaching judges in the UI, you can evaluate arbitrary input and output pairs directly using the AI SDK and a judge key.

This approach:

Does not require attaching a judge to a completion-mode variation.
Can be used to evaluate outputs from agent-based workflows.
Lets you evaluate responses from custom pipelines, external systems, or application workflows outside the standard config setup.

The following Python example shows how to evaluate input and output directly using a judge:

Python: direct judge evaluation

1 import asyncio
2 import ldclient
3 from ldclient.config import Config
4 from ldclient import Context
5 from ldai import LDAIClient, AICompletionConfigDefault
6 
7 async def main():
8     # Initialize LaunchDarkly
9     ldclient.set_config(Config("YOUR_SDK_KEY"))
10     ai_client = LDAIClient(ldclient.get())
11 
12     # Create evaluation context
13     context = Context.builder("example-user-key").kind("user").build()
14 
15     # Fallback if judge is unavailable
16     fallback = AICompletionConfigDefault(enabled=False)
17 
18     # Retrieve judge configuration
19     judge = await ai_client.create_judge(
20         "your-judge-key",
21         context,
22         fallback
23     )
24 
25     if judge and judge.enabled:
26         input_text = "User question"
27         output_text = "Model response"
28 
29         result = await judge.evaluate(input_text, output_text)
30 
31         if result:
32             print(result.to_dict())
33 
34 asyncio.run(main())

For a complete, production-ready example including initialization checks and error handling, read the Python create_judge example on GitHub.

In this example:

create_judge() retrieves the judge configuration for the provided context.
evaluate() scores the input and output pair.
The returned result includes structured evaluation data such as scores and reasoning.

If you want to record evaluation scores as metrics associated with a config, explicitly track the returned evaluation scores in your application code using the config’s tracker. Programmatic judge evaluation does not automatically emit monitoring metrics.

Programmatic judge evaluation does not attach judges to variations in the UI and does not automatically enable Monitoring tab metrics, guarded rollout integration, or experiment metric selection.

Privacy and data handling

Online evaluations run using your configured model provider credentials. LaunchDarkly does not store or share prompts, model responses, or evaluation data outside your project.

For more information, read AgentControl and information privacy.