Judges

Overview

This topic explains how to create and manage custom judges for online evaluations in AgentControl.

A judge is a specialized AgentControl config that uses a large language model (LLM) as a judge to evaluate responses and return numeric scores that represent quality signals related to accuracy, relevance, toxicity, or custom signals you define. AgentControl’s online evaluations use judges to evaluate model responses and score config outputs in production.

Judge types

AgentControl offers three built-in judges, which you can configure to evaluate different response criteria. You can also create custom judges to evaluate additional or domain-specific quality signals using the same evaluation framework.

Built-in judges

The built-in judges are:

  • Accuracy, which assesses if a response is correct,
  • Relevance, which assesses if a response is scoped appropriately to the input, and
  • Toxicity, which assesses if a response has undesirable elements.

Custom judges

Custom judges let organizations define what quality means for their own products and domains. With custom judges, teams can:

  • Standardize and reuse evaluation logic across configs and environments
  • Measure domain-specific quality signals such as accuracy, relevance, or toxicity
  • Apply consistent quality criteria in monitoring, guarded rollouts, and experiments to detect regressions in production

You can attach custom judges to completion-mode config variations in the LaunchDarkly UI. For other variations, invoke a custom judge programmatically using the AI SDK. To learn more, read Online evaluations.

When you create a judge, LaunchDarkly provides a default evaluation configuration that you can customize for your use case. You can create a custom judge from the “Create AgentControl config” menu.

Access control and prerequisites

Custom judges use the same access control model as other configs. To create or edit a judge, you must have permission to create and update AgentControl configs in the project.

You must enable online evaluations for the LaunchDarkly project before evaluation metrics will record or display. If online evaluations are not enabled, you can attach a judge to a config variation but no evaluation results will appear.

Support for automatic evaluation metric recording starts in these SDK versions:

  • Python AI SDK version 0.14.0
  • Node.js AI SDK version 0.16.1
Your AI SDK must support online evaluations to use custom judges

To record evaluation results from custom judges, you must use a LaunchDarkly AI SDK version that includes online evaluations. If your SDK does not support online evaluations, you can attach custom judges but no evaluation metrics will appear.

If your application uses an earlier AI SDK version, judge results may still be available to application code, but evaluation metrics are not automatically recorded in LaunchDarkly. In these cases, teams can access judge properties directly and handle evaluation results manually.

Create a custom judge

To create a custom judge:

  1. Navigate to AgentControl.
  2. Click Create config.
  3. Choose Judge as the config type.
  4. Enter a name for the judge. The judge’s config and event keys generate automatically based on the name.
  5. In the “Provider” dropdown, choose an LLM provider.
  6. In the “Judge type” dropdown, choose “Custom”.
  7. (Optional) Add any views from the “Views” dropdown.
  8. Click Create. The judge’s config page appears.

After you create a judge, it appears in the Judges tab in the Configs page.

Configure custom judges

New custom judges appear with a default evaluation configuration that you can update.

After you create a custom judge, you can update its configuration, manage its variations, and attach it to config variations. Judges use the same editing interface as other configs, with judge-specific settings and restrictions.

Judges have similar configuration options to other AgentControl configs. To learn more, read Create config variations.

Judges also include additional configuration options. From the judge configuration page, you can:

  • Change or copy the judge’s event key. Evaluation metric keys use the $ld:ai:judge:EXAMPLE-JUDGE-KEY event prefix to identify how evaluation scores are stored and aggregated. Metric keys do not need to be unique across judges. Teams can reuse a metric key across multiple judges to aggregate evaluation results intentionally.
  • Set the desired direction for evaluation results. This defines which direction evaluation results should trend for a passing result. Configure this setting in the right navigation of the judge’s configuration page.

Judges also have the following restrictions:

  • You cannot attach tools to a judge.
  • You cannot attach judges to a judge.

These restrictions help maintain consistent evaluation behavior and prevent recursive or ambiguous evaluations.

By default, judges have one variation. To add more, click to collapse the first variation and then click Add variation.

Attach judges to config variations

Judges do not run independently. A judge evaluates responses only when it is attached to a config variation and that variation receives live traffic.

After you create a judge, attach it to one or more config variations to evaluate model responses.

To attach a judge to a config variation:

  1. Navigate to the config you want to evaluate.
  2. Select the Variations tab.
  3. Expand a variation.
  4. In the “Judges” section, click Attach judges.
  5. Select a judge.
  6. Adjust the sampling percentage as needed.
  7. Click Review and save.

After attaching a judge, you can:

  • Select the judge name to open the judge details page.
  • Select the evaluation metric key to open the metric details page.

You cannot attach multiple judges that produce the same evaluation metric key to a single variation. Evaluations respect the configured sampling percentage.

Attaching a judge does not immediately produce data. Evaluation results appear only after the variation receives live traffic.

Use evaluation results

Evaluation results from custom judges appear throughout LaunchDarkly as standard AI metrics. Judges return structured results with a numeric score and brief reasoning.

You do not need to define output formatting. LaunchDarkly enforces structured output so evaluation results can be reliably recorded and displayed as metrics. Each evaluation metric produces a single score between 0.0 and 1.0.

View results

To view evaluation results in monitoring:

  1. Navigate to the config with the attached judge.
  2. Select the Monitoring tab.
  3. Use the metric dropdown to select the evaluation metric key.

Charts display average scores over time and update as evaluations run.

To view evaluation results from the Metrics page:

  1. Navigate to Metrics.
  2. Select the Judge metrics tab to filter metrics with the $ld:ai:judge: prefix.
  3. Select a metric to view its details and trends.

Use evaluation metrics in guardrails and experiments

Evaluation metrics produced by custom judges behave like other AI metrics.

You can:

  • Use evaluation metrics as guardrails in guarded rollouts to pause or revert releases when quality degrades.
  • Select evaluation metrics as goals in experiments to compare config variations.
  • Use judge scores in your application’s execution logic to enforce custom guardrails at runtime, when the evaluation sampling rate is set to 100% and every model response is evaluated.

This allows teams to apply quality controls through guarded rollouts and experiments, and to enforce additional safeguards directly in application code when required.

Create judge payload format

When you create a judge, LaunchDarkly creates a new config in judge mode using the name and key you provide.

JSON: Judge configuration payload
1{
2 "key": "<string>",
3 "name": "<string>",
4 "maintainerId": "<string>",
5 "maintainerTeamKey": "<string>",
6
7 "mode": "judge",
8 "tags": ["ai", "judge"],
9
10 "evaluationMetricKey": "<string>",
11 "isInverted": false,
12
13 defaultVariation: {
14 "key": "default",
15 "name": "Default",
16 messages: [
17 { "role": "system", content: "..." },
18 { "role": "assistant", content: "..." }
19 ]
20 }
21}

Judge output format

JSON: Judge output format
1{
2 "score": 0.0,
3 "reasoning": "..."
4}

Evaluation event format

LaunchDarkly records evaluation results as numeric metric values.

JSON: Judge evaluation event format
1{
2 "$ld:ai:judge:<metricKey>": 0.0
3}

Run judge evaluations programmatically

In addition to attaching judges in the UI, you can evaluate arbitrary input and output pairs directly using the AI SDK and a judge key.

This approach:

  • Does not require attaching a judge to a completion-mode variation
  • Can be used to evaluate outputs from agent-based workflows
  • Lets you evaluate responses from custom pipelines or external pipelines or custom application workflows

The following Python example shows how to evaluate input and output directly using a judge:

Python: direct judge evaluation
1import asyncio
2import ldclient
3from ldclient.config import Config
4from ldclient import Context
5from ldai import LDAIClient, AICompletionConfigDefault
6
7async def main():
8 # Initialize LaunchDarkly
9 ldclient.set_config(Config("YOUR_SDK_KEY"))
10 ai_client = LDAIClient(ldclient.get())
11
12 # Create evaluation context
13 context = Context.builder("example-user-key").kind("user").build()
14
15 # Fallback if judge is unavailable
16 fallback = AICompletionConfigDefault(enabled=False)
17
18 # Retrieve judge configuration
19 judge = await ai_client.create_judge(
20 "your-judge-key",
21 context,
22 fallback
23 )
24
25 if judge and judge.enabled:
26 input_text = "User question"
27 output_text = "Model response"
28
29 result = await judge.evaluate(input_text, output_text)
30
31 if result:
32 print(result.to_dict())
33
34asyncio.run(main())

For a complete, production-ready example including initialization checks and error handling, read the Python direct_judge_example.py on GitHub.

In this example:

  • create_judge() retrieves the judge configuration for the provided context.
  • evaluate() scores the input and output pair.
  • The returned result includes structured evaluation data such as scores and reasoning.

If you want to record evaluation scores as metrics associated with a config, explicitly track the returned evaluation scores in your application code using the config’s tracker. Programmatic judge evaluation does not automatically emit monitoring metrics.

Programmatic judge evaluation does not attach judges to variations in the UI and does not automatically enable Monitoring tab metrics, guarded rollout integration, or experiment metric selection.

Privacy and data handling

Online evaluations run using your configured model provider credentials. LaunchDarkly does not store or share prompts, model responses, or evaluation data outside your project.

For more information, read AgentControl and information privacy.