Evaluate LLM code generation with LLM-as-judge evaluators

Published March 3, 2026

by Scarlett Attensil

Newer features are available with AgentControl

This tutorial was published in early March 2026, before LaunchDarkly shipped offline evaluations and prompt snippets. The custom-judge proxy pattern still works, but for new builds you may also want to use:

Offline evaluations and Datasets: Run the same custom judges as regression tests in CI against a saved reference set of prompts, without the proxy server
Prompt snippets: Reuse common judge-prompt fragments (rubric language, output schema) across the three judges in this tutorial

To learn more, read AgentControl.

Which AI model writes the best code for your codebase? Not “best” in general, but best for your security requirements, your API schemas, and your team’s blind spots.

This tutorial shows you how to score every code generation response against custom criteria you define. You’ll set up custom judges that check for the vulnerabilities you actually care about, validate against your real API conventions, and flag the scope creep patterns your team keeps running into. After a few weeks of data, you’ll have evidence to choose which model to use for which tasks.

What you will build

In this tutorial you build a proxy server that routes Claude Code requests through LaunchDarkly. You can forward requests to any model: Anthropic, OpenAI, Mistral, or local Ollama instances. Every response gets scored by custom judges you create.

You will build three judges:

Security: Checks for SQL injection, XSS, hardcoded secrets, and the specific vulnerabilities you care about
API contract: Validates code against your schema conventions
Minimal change: Flags scope creep and unnecessary modifications

After setup, you use Claude Code normally, and scores flow to the LaunchDarkly Monitoring dashboard automatically. Over time, you build a dataset grounded in your actual usage: maybe Sonnet scores consistently higher on security, but Opus handles API contract adherence better on complex endpoints. That’s the kind of answer a generic benchmark can’t give you.

To learn more, read Online evaluations or watch the Introducing Judges video tutorial.

Prerequisites

LaunchDarkly account with AgentControl enabled
Python 3.9+
LaunchDarkly Python AI SDK v0.14.0+ (launchdarkly-server-sdk-ai)
API keys for your model providers
Claude Code installed

How the proxy works

This proxy implements a minimal Anthropic Messages-style gateway for text-only code generation and automatic quality scoring.

When Claude Code sends a request to POST /v1/messages, the proxy:

Extracts text-only prompts. It converts the Anthropic Messages body into LaunchDarkly LDMessages, keeping only text content. It ignores tool blocks, images, and other non-text content.
Routes the request through AgentControl. The proxy creates a context with a selectedModel attribute. Your model-selector config uses targeting rules on this attribute to pick the right model variation.
Invokes the model and triggers judges. The proxy calls chat.invoke(). If the selected variation has judges attached, the SDK schedules judge evaluations automatically based on your sampling rate. Scores flow to LaunchDarkly Monitoring.
Returns a standard Messages response. The proxy sends back the assistant response as a single text block, plus basic token usage if available.

Claude Code talks to a local /v1/messages endpoint. LaunchDarkly handles model selection and online evaluations behind the scenes.

Create the config and judges

You can use the LaunchDarkly dashboard or Claude Code with agent skills. Agent skills are faster if you have them installed.¹

Fastest path: connect the LaunchDarkly MCP server

This tutorial was published before the LaunchDarkly MCP server shipped. If your AI coding assistant (Claude Code, Cursor, and others) is connected to the AgentControl MCP server, you can skip both the dashboard clicks and the agent skills entirely — the MCP server creates the config and judges, attaches judges to variations, and sets the targeting rules in a single agent session.

In particular, the MCP server covers the two steps the agent skills can’t yet do on their own (judge attachment and targeting — see the note below), so connecting it gives you the full setup with no dashboard fallback. Paste the prompts from Option A and let your assistant run the MCP calls. To connect it, read Set up the LaunchDarkly MCP server.

Option A: Agent skills

Create the project:

/projects Create a project called "custom-evals-claude-code"

Create the model selector:

/configs-create
Create a completion mode config:
- Key: model-selector
- Name: Model Selector
- Project: custom-evals-claude-code
Three variations (empty messages, this is a router):
1. "sonnet" - Anthropic claude-sonnet-4-6
2. "opus" - Anthropic claude-opus-4-6
3. "mistral" - Mistral mistral-large@2407

Create the security judge:

/configs-create
Create a judge config with:
- Key: security-judge
- Name: Security Judge
- Project: custom-evals-claude-code
- Evaluation metric key: $ld:ai:judge:security
System prompt:
"You are a security auditor evaluating AI-generated code for vulnerabilities.
Analyze the assistant's response and score it from 0.0 to 1.0:
SCORING CRITERIA:
- 1.0: No security issues detected. Code follows security best practices.
- 0.7-0.9: Minor issues that pose low risk.
- 0.4-0.6: Moderate issues requiring attention.
- 0.1-0.3: Serious vulnerabilities present (SQL injection, XSS, command injection).
- 0.0: Critical vulnerabilities that could lead to immediate compromise.
CHECK FOR:
- Injection flaws (SQL, command, LDAP)
- Cross-site scripting (XSS)
- Hardcoded secrets or credentials
- Insecure file operations
- Missing input validation
If no code is present, return 1.0."
Use model gpt-5-mini with temperature 0.3.

Create the API contract judge:

/configs-create
Create a judge config with:
- Key: api-contract-judge
- Name: API Contract Adherence
- Project: custom-evals-claude-code
- Evaluation metric key: $ld:ai:judge:api-contract-adherence
System prompt:
"You are an API contract auditor. Evaluate whether AI-generated code adheres to the API schema.
SCORING CRITERIA:
- 1.0: Code fully complies with expected patterns.
- 0.5: Partial adherence with minor deviations.
- 0.0: Invalid format or significant violations.
If no API code is present, return 1.0."
Use model gpt-5-mini with temperature 0.3.

Create the minimal change judge:

/configs-create
Create a judge config with:
- Key: minimal-change-judge
- Name: Minimal Change Judge
- Project: custom-evals-claude-code
- Evaluation metric key: $ld:ai:judge:minimal-change
System prompt:
"You are a code review auditor focused on change scope. Evaluate whether the AI assistant made only necessary changes.
SCORING CRITERIA:
- 1.0: Changes are precisely scoped to the request. No unnecessary modifications.
- 0.5: Some unnecessary additions (reformatting unrelated code, extra comments).
- 0.0: Significant scope creep (rewriting large sections, architectural changes not requested).
FLAG THESE UNNECESSARY CHANGES:
- Reformatting code not part of the request
- Adding type annotations to unchanged functions
- Inserting unrequested comments or docstrings
- Renaming variables outside the scope of the fix
If no code changes present, return 1.0."
Use model gpt-5-mini with temperature 0.3.

Attach judges to the model selector:

/online-evals
Attach to all model-selector variations at 100% sampling:
- security-judge
- api-contract-judge
- minimal-change-judge

Set up targeting:

For each config, go to the Targeting tab and edit the default rule to serve the variation you created. For the model selector, also add rules that match the selectedModel context attribute:

/configs-targeting
For each judge (security-judge, api-contract-judge, minimal-change-judge):
- Set the default rule to serve the variation you created
For model-selector:
- Rule: if selectedModel contains "sonnet", serve Sonnet variation
- Rule: if selectedModel contains "mistral", serve Mistral variation
- Default rule: Opus variation

When the proxy sends selectedModel: "sonnet", LaunchDarkly returns the Sonnet variation. To learn more, read Config targeting.

Option B: LaunchDarkly dashboard

Step 1: Create the model selector config

Go to AgentControl and click Create AgentControl config.
Set the mode to Completion, the key to model-selector, and name it “Model Selector”.
Add three variations with empty messages (this config acts as a router):
- Sonnet (key: sonnet) using claude-sonnet-4-6
- Opus (key: opus) using claude-opus-4-6
- Mistral (key: mistral) using mistral-large@2407

Model Selector config showing three variations: Sonnet, Opus, and Mistral with their corresponding model names. — Model Selector with three variations for different models

Step 2: Create the judge configs

Click Create AgentControl config and set the mode to Judge.
Set the key (for example, security-judge) and name (for example, “Security Judge”).
Set the Event key to the metric you want to track (for example, $ld:ai:judge:security).
Add the system prompt with scoring criteria from the prompts in Option A.
Set the model to gpt-5-mini with temperature 0.3.
Repeat for each judge: security, API contract adherence, and minimal change.

judge config creation form showing mode set to Judge, event key field, system prompt with scoring criteria, and model configuration. — judge config with event key and scoring criteria

Step 3: Attach judges to the model selector

Open the Model Selector config and go to the Variations tab.
Expand a variation (for example, Sonnet) and find the Judges section.
Click Attach judges.

Model Selector variation expanded showing the Judges section with an Attach judges button. — Expand a variation to find the Judges section

Select the judges you created and set the sampling percentage to 100%.
Repeat for each variation.

Judge selection dropdown showing available judges with checkboxes, event keys, and sampling percentage fields. — Select judges and set sampling percentage

Step 4: Configure targeting rules

Go to the Targeting tab for the Model Selector.
Add rules to route requests based on the selectedModel context attribute:
- If selectedModel is mistral, serve the Mistral variation
- If selectedModel is sonnet, serve the Sonnet variation
- Default rule: serve Opus
For each judge, set the default rule to serve the variation you created.

Targeting tab showing rules that route selectedModel values to the corresponding variations, with Opus as the default. — Targeting rules route requests to the correct model variation

To learn more, read Custom judges.

Verify your setup

Before running the proxy, confirm in the dashboard:

Model selector: Each variation shows three attached judges.
Judges: Each judge prompt includes scoring criteria.
Targeting: All configs have targeting enabled with correct rules.

Set up the project

Create a directory and install dependencies:

$ mkdir custom-evals && cd custom-evals
$ python -m venv .venv && source .venv/bin/activate
$ pip install fastapi uvicorn launchdarkly-server-sdk launchdarkly-server-sdk-ai \
>     launchdarkly-server-sdk-ai-langchain langchain-anthropic python-dotenv

Create .env:

$ LD_SDK_KEY=sdk-your-sdk-key-here
$ LD_AI_CONFIG_KEY=model-selector
$ MODEL_KEY=sonnet
$ ANTHROPIC_API_KEY=sk-ant-your-key-here
$ OPENAI_API_KEY=sk-your-key-here
$ PORT=9911

Build the proxy server

Create server.py with the following code.

Click to expand the complete proxy server code

1 """
2 Proxy server for Claude Code with automatic quality scoring.
3 
4 Routes requests through AgentControl and scores every response
5 with attached judges. Metrics flow to the LaunchDarkly Monitoring dashboard.
6 """
7 
8 import asyncio
9 import os
10 import logging
11 import uuid
12 
13 import ldclient
14 from ldclient import Context
15 from ldai import AICompletionConfigDefault, LDAIClient, LDMessage
16 from fastapi import FastAPI, Request
17 from fastapi.responses import JSONResponse
18 import uvicorn
19 
20 from dotenv import load_dotenv
21 load_dotenv()
22 
23 LD_SDK_KEY = os.environ.get("LD_SDK_KEY")
24 LD_AI_CONFIG_KEY = os.environ.get("LD_AI_CONFIG_KEY", "model-selector")
25 PORT = int(os.environ.get("PORT", "9911"))
26 
27 if not LD_SDK_KEY:
28     raise ValueError("Missing LD_SDK_KEY environment variable")
29 
30 LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO").upper()
31 logging.basicConfig(level=getattr(logging, LOG_LEVEL, logging.INFO))
32 
33 ld_config = ldclient.Config(LD_SDK_KEY)
34 ldclient.set_config(ld_config)
35 ld_client = ldclient.get()
36 
37 if not ld_client.is_initialized():
38     raise RuntimeError("LaunchDarkly client failed to initialize")
39 
40 ai_client = LDAIClient(ld_client)
41 app = FastAPI()
42 
43 # =============================================================================
44 # Message Conversion
45 # =============================================================================
46 
47 def extract_text(content) -> str:
48     """Extract plain text from Anthropic-style content."""
49     if isinstance(content, str):
50         return content
51     if isinstance(content, list):
52         texts = []
53         for block in content:
54             if isinstance(block, dict) and block.get("type") == "text":
55                 texts.append(block.get("text", ""))
56         return "".join(texts)
57     return str(content or "")
58 
59 
60 def convert_to_ld_messages(body: dict) -> list[LDMessage]:
61     """Convert Anthropic Messages API format to LDMessage format."""
62     messages = []
63 
64     system = body.get("system")
65     if system:
66         system_text = extract_text(system) if isinstance(system, list) else system
67         messages.append(LDMessage(role="system", content=system_text))
68 
69     for msg in body.get("messages", []):
70         role_str = msg.get("role", "user")
71         role = "assistant" if role_str == "assistant" else "user"
72         messages.append(LDMessage(role=role, content=extract_text(msg.get("content", ""))))
73 
74     return messages
75 
76 # =============================================================================
77 # Routes
78 # =============================================================================
79 
80 @app.post("/v1/messages")
81 async def handle_messages(request: Request):
82     """Main endpoint using chat.invoke() for automatic judge execution."""
83     body = await request.json()
84     user_key = request.headers.get("x-ld-user-key", "claude-code-local")
85 
86     # Build context with selectedModel for targeting
87     model_key = os.environ.get("MODEL_KEY", "")
88     context = (
89         Context.builder(user_key)
90         .set("selectedModel", model_key)
91         .build()
92     )
93 
94     fallback = AICompletionConfigDefault(enabled=False)
95     chat = await ai_client.create_chat(LD_AI_CONFIG_KEY, context, fallback, {})
96 
97     if not chat:
98         return JSONResponse(
99             {"type": "error", "error": {"type": "unavailable", "message": "config disabled"}},
100             status_code=503
101         )
102 
103     config = chat.get_config()
104     model_name = config.model.name if config.model else "unknown"
105     judge_count = len(config.judge_configuration.judges) if config.judge_configuration else 0
106 
107     print(f"[REQUEST] model={model_name}, judges={judge_count}")
108 
109     try:
110         ld_messages = convert_to_ld_messages(body)
111 
112         if len(ld_messages) > 1:
113             chat.append_messages(ld_messages[:-1])
114 
115         last_message = ld_messages[-1] if ld_messages else LDMessage(role="user", content="")
116 
117         # invoke() executes judges automatically based on sampling rate
118         response = await chat.invoke(last_message.content)
119 
120         # Await judge evaluations and log results
121         if response.evaluations:
122             print(f"[JUDGES] Awaiting {len(response.evaluations)} evaluations...")
123             eval_results = await asyncio.gather(*response.evaluations, return_exceptions=True)
124             for result in eval_results:
125                 if isinstance(result, Exception):
126                     print(f"[JUDGE ERROR] {result}")
127                 elif result:
128                     print(f"[JUDGE] {result.to_dict()}")
129 
130         # Flush events to LaunchDarkly
131         ld_client.flush()
132         await asyncio.sleep(0.1)
133 
134         response_text = response.message.content if response.message else ""
135 
136         # Get token metrics
137         input_tokens = 0
138         output_tokens = 0
139         if response.metrics and response.metrics.usage:
140             input_tokens = response.metrics.usage.input or 0
141             output_tokens = response.metrics.usage.output or 0
142 
143         print(f"[METRICS] tokens={input_tokens}/{output_tokens}")
144 
145         return JSONResponse({
146             "id": f"msg_{uuid.uuid4().hex[:24]}",
147             "type": "message",
148             "role": "assistant",
149             "content": [{"type": "text", "text": response_text}],
150             "model": model_name,
151             "stop_reason": "end_turn",
152             "usage": {
153                 "input_tokens": input_tokens,
154                 "output_tokens": output_tokens
155             }
156         })
157 
158     except Exception as e:
159         ld_client.flush()
160         logging.exception("Request failed")
161         return JSONResponse(
162             {"type": "error", "error": {"type": "internal_error", "message": str(e)}},
163             status_code=500
164         )
165 
166 
167 @app.get("/health")
168 async def health():
169     return {"status": "ok", "launchdarkly": ld_client.is_initialized()}
170 
171 
172 @app.post("/v1/messages/count_tokens")
173 async def count_tokens(request: Request):
174     return {"input_tokens": 0}
175 
176 # =============================================================================
177 # Main
178 # =============================================================================
179 
180 if __name__ == "__main__":
181     print(f"Proxy running on port {PORT}")
182     print(f"Config: {LD_AI_CONFIG_KEY}")
183     print(f"Connect: ANTHROPIC_BASE_URL=http://localhost:{PORT} claude")
184     uvicorn.run(app, host="127.0.0.1", port=PORT, log_level="info")

Connect Claude Code to your proxy

Start the proxy server:

$ python server.py

You should see output like:

Proxy running on port 9911
Config: model-selector
Connect: ANTHROPIC_BASE_URL=http://localhost:9911 claude

In a new terminal, launch Claude Code with the proxy URL and your chosen model:

$ MODEL_KEY=sonnet ANTHROPIC_BASE_URL=http://localhost:9911 claude

Every request now routes through your proxy. Watch the server logs to see judges executing:

[REQUEST] model=claude-sonnet-4-6, judges=3
[JUDGES] Awaiting 3 evaluations...
[JUDGE] {'evals': {'security': {'score': 1.0, 'reasoning': 'No vulnerabilities detected...'}}}
[JUDGE] {'evals': {'api-contract': {'score': 0.5, 'reasoning': 'Response uses correct endpoint...'}}}
[JUDGE] {'evals': {'minimal-change': {'score': 1.0, 'reasoning': 'Changes are focused...'}}}

The key pattern for automatic judge evaluation

The create_chat() and invoke() methods handle judge execution automatically:

1 chat = await ai_client.create_chat(config_key, context, fallback, {})
2 response = await chat.invoke(user_message)
3 # response.evaluations contains async judge tasks

Judge results are sent to LaunchDarkly automatically. You can optionally await response.evaluations to log results locally.

Tool features aren't supported

This proxy handles text-based conversations. Tool-based features like file editing and command execution won’t work through this proxy.

How model routing works

The MODEL_KEY environment variable controls which model handles requests. The proxy passes it as a selectedModel context attribute:

1 context = Context.builder(user_key).set("selectedModel", model_key).build()

Your targeting rules match this attribute and return the corresponding variation. Switch models by changing the environment variable:

$ MODEL_KEY=mistral ANTHROPIC_BASE_URL=http://localhost:9911 claude

Compare cloud and local models

To evaluate Ollama models against cloud providers:

Add an “ollama” variation to your model-selector config.
Add a targeting rule for selectedModel equals “ollama”.
Launch with MODEL_KEY=ollama.

Your custom judges score Claude Sonnet and Llama 3.2 with identical criteria. After enough requests, you can compare quality scores across providers.

Run experiments

After judges are producing scores, you can compare models statistically. Create two variations with different models, attach the same judges, and set up a percentage rollout to split traffic.

Your judge metrics appear as goals in LaunchDarkly Experimentation. After enough data, you can answer “Which model produces more secure code?” with confidence, not guesswork.

To learn more, read Experimentation with AgentControl.

Monitor quality over time

Judge scores appear on your config’s Monitoring tab. To view evaluation metrics:

Open your model-selector config and go to the Monitoring tab.
Select Evaluator metrics from the dropdown menu.

Each judge (security, API contract, minimal change) shows as a separate chart. Hover over a chart to see scores broken down by variation.

Security judge scores over time

API contract adherence scores

Minimal change judge scores

To drill into a specific model’s evaluations, select the variation from the bottom menu.

Select a variation to see its evaluations

Watch for baseline patterns in the first week, then track regressions after model updates or prompt changes. Model providers ship updates without notice. A Claude update might improve reasoning but introduce patterns that fail your API contract checks. Set up alerts when scores drop below thresholds, and use guarded rollouts for automatic protection.

To learn more, read Monitor config performance.

Control costs with sampling

Each judge evaluation is an LLM call. Control costs by adjusting sampling rates:

Staging: 100% sampling to catch issues early
Production: 10-25% sampling for cost efficiency

You can also use cheaper models (GPT-4o mini) for staging and more capable models for production.

What you learned

The value is in the judges you create. The three in this tutorial cover security, API compliance, and scope discipline. Your team might care about different signals: documentation quality, test coverage, or adherence to internal coding standards.

Custom judges let you define quality for your codebase, apply the same evaluation criteria across models, and track trends over time. Once you create a judge, you can attach it to any config in your project.

Start your free trial

Ready to build custom judges for your codebase? Start your 14-day free trial and deploy your first evaluation today.

Next steps

hello-python-ai examples for more judge patterns
Building a chatbot with multiple AI providers using AgentControl for production patterns
When to add online evals - Decide which judges run on live traffic and at what sampling rate
Offline Evaluation of RAG-Grounded Answers - Run the same judges as regression tests in CI before promoting a variation
Beyond n8n for Workflow Automation: Agent Graphs - Attach per-node judges once your workflow is externalized as a graph

The /online-evals and /configs-targeting skills are not yet available. Use the LaunchDarkly MCP server or the dashboard to complete those steps. ↩