Evaluate LLM code generation with LLM-as-judge evaluators

Published March 3, 2026

Portrait of Scarlett Attensil.

by Scarlett Attensil

Which AI model writes the best code for your codebase? Not “best” in general, but best for your security requirements, your API schemas, and your team’s blind spots.

This tutorial shows you how to score every code generation response against custom criteria you define. You’ll set up custom judges that check for the vulnerabilities you actually care about, validate against your real API conventions, and flag the scope creep patterns your team keeps running into. After a few weeks of data, you’ll have evidence to choose which model to use for which tasks.

What you will build

In this tutorial you build a proxy server that routes Claude Code requests through LaunchDarkly. You can forward requests to any model: Anthropic, OpenAI, Mistral, or local Ollama instances. Every response gets scored by custom judges you create.

You will build three judges:

  • Security: Checks for SQL injection, XSS, hardcoded secrets, and the specific vulnerabilities you care about
  • API contract: Validates code against your schema conventions
  • Minimal change: Flags scope creep and unnecessary modifications

After setup, you use Claude Code normally, and scores flow to the LaunchDarkly Monitoring dashboard automatically. Over time, you build a dataset grounded in your actual usage: maybe Sonnet scores consistently higher on security, but Opus handles API contract adherence better on complex endpoints. That’s the kind of answer a generic benchmark can’t give you.

To learn more, read Online evaluations or watch the Introducing Judges video tutorial.

Prerequisites

  • LaunchDarkly account with AI Configs enabled
  • Python 3.9+
  • LaunchDarkly Python AI SDK v0.14.0+ (launchdarkly-server-sdk-ai)
  • API keys for your model providers
  • Claude Code installed

How the proxy works

This proxy implements a minimal Anthropic Messages-style gateway for text-only code generation and automatic quality scoring.

When Claude Code sends a request to POST /v1/messages, the proxy:

  1. Extracts text-only prompts. It converts the Anthropic Messages body into LaunchDarkly LDMessages, keeping only text content. It ignores tool blocks, images, and other non-text content.

  2. Routes the request through LaunchDarkly AI Configs. The proxy creates a context with a selectedModel attribute. Your model-selector AI Config uses targeting rules on this attribute to pick the right model variation.

  3. Invokes the model and triggers judges. The proxy calls chat.invoke(). If the selected variation has judges attached, the SDK schedules judge evaluations automatically based on your sampling rate. Scores flow to LaunchDarkly Monitoring.

  4. Returns a standard Messages response. The proxy sends back the assistant response as a single text block, plus basic token usage if available.

Claude Code talks to a local /v1/messages endpoint. LaunchDarkly handles model selection and online evaluations behind the scenes.

Create the AI Config and judges

You can use the LaunchDarkly dashboard or Claude Code with agent skills. Agent skills are faster if you have them installed.1

Option A: Agent skills

Create the project:

/aiconfig-projects Create a project called "custom-evals-claude-code"

Create the model selector:

/aiconfig-create
Create a completion mode AI Config:
- Key: model-selector
- Name: Model Selector
- Project: custom-evals-claude-code
Three variations (empty messages, this is a router):
1. "sonnet" - Anthropic claude-sonnet-4-6
2. "opus" - Anthropic claude-opus-4-6
3. "mistral" - Mistral mistral-large@2407

Create the security judge:

/aiconfig-create
Create a judge AI Config with:
- Key: security-judge
- Name: Security Judge
- Project: custom-evals-claude-code
- Evaluation metric key: $ld:ai:judge:security
System prompt:
"You are a security auditor evaluating AI-generated code for vulnerabilities.
Analyze the assistant's response and score it from 0.0 to 1.0:
SCORING CRITERIA:
- 1.0: No security issues detected. Code follows security best practices.
- 0.7-0.9: Minor issues that pose low risk.
- 0.4-0.6: Moderate issues requiring attention.
- 0.1-0.3: Serious vulnerabilities present (SQL injection, XSS, command injection).
- 0.0: Critical vulnerabilities that could lead to immediate compromise.
CHECK FOR:
- Injection flaws (SQL, command, LDAP)
- Cross-site scripting (XSS)
- Hardcoded secrets or credentials
- Insecure file operations
- Missing input validation
If no code is present, return 1.0."
Use model gpt-5-mini with temperature 0.3.

Create the API contract judge:

/aiconfig-create
Create a judge AI Config with:
- Key: api-contract-judge
- Name: API Contract Adherence
- Project: custom-evals-claude-code
- Evaluation metric key: $ld:ai:judge:api-contract-adherence
System prompt:
"You are an API contract auditor. Evaluate whether AI-generated code adheres to the API schema.
SCORING CRITERIA:
- 1.0: Code fully complies with expected patterns.
- 0.5: Partial adherence with minor deviations.
- 0.0: Invalid format or significant violations.
If no API code is present, return 1.0."
Use model gpt-5-mini with temperature 0.3.

Create the minimal change judge:

/aiconfig-create
Create a judge AI Config with:
- Key: minimal-change-judge
- Name: Minimal Change Judge
- Project: custom-evals-claude-code
- Evaluation metric key: $ld:ai:judge:minimal-change
System prompt:
"You are a code review auditor focused on change scope. Evaluate whether the AI assistant made only necessary changes.
SCORING CRITERIA:
- 1.0: Changes are precisely scoped to the request. No unnecessary modifications.
- 0.5: Some unnecessary additions (reformatting unrelated code, extra comments).
- 0.0: Significant scope creep (rewriting large sections, architectural changes not requested).
FLAG THESE UNNECESSARY CHANGES:
- Reformatting code not part of the request
- Adding type annotations to unchanged functions
- Inserting unrequested comments or docstrings
- Renaming variables outside the scope of the fix
If no code changes present, return 1.0."
Use model gpt-5-mini with temperature 0.3.

Attach judges to the model selector:

/aiconfig-online-evals
Attach to all model-selector variations at 100% sampling:
- security-judge
- api-contract-judge
- minimal-change-judge

Set up targeting:

For each AI Config, go to the Targeting tab and edit the default rule to serve the variation you created. For the model selector, also add rules that match the selectedModel context attribute:

/aiconfig-targeting
For each judge (security-judge, api-contract-judge, minimal-change-judge):
- Set the default rule to serve the variation you created
For model-selector:
- Rule: if selectedModel contains "sonnet", serve Sonnet variation
- Rule: if selectedModel contains "mistral", serve Mistral variation
- Default rule: Opus variation

When the proxy sends selectedModel: "sonnet", LaunchDarkly returns the Sonnet variation. To learn more, read Target with AI Configs.

Option B: LaunchDarkly dashboard

Step 1: Create the model selector config

  1. Go to AI Configs and click Create AI Config.
  2. Set the mode to Completion, the key to model-selector, and name it “Model Selector”.
  3. Add three variations with empty messages (this config acts as a router):
    • Sonnet (key: sonnet) using claude-sonnet-4-6
    • Opus (key: opus) using claude-opus-4-6
    • Mistral (key: mistral) using mistral-large@2407

Model Selector AI Config showing three variations: Sonnet, Opus, and Mistral with their corresponding model names.

Model Selector with three variations for different models

Step 2: Create the judge AI Configs

  1. Click Create AI Config and set the mode to Judge.
  2. Set the key (for example, security-judge) and name (for example, “Security Judge”).
  3. Set the Event key to the metric you want to track (for example, $ld:ai:judge:security).
  4. Add the system prompt with scoring criteria from the prompts in Option A.
  5. Set the model to gpt-5-mini with temperature 0.3.
  6. Repeat for each judge: security, API contract adherence, and minimal change.

Judge AI Config creation form showing mode set to Judge, event key field, system prompt with scoring criteria, and model configuration.

Judge AI Config with event key and scoring criteria

Step 3: Attach judges to the model selector

  1. Open the Model Selector AI Config and go to the Variations tab.
  2. Expand a variation (for example, Sonnet) and find the Judges section.
  3. Click Attach judges.

Model Selector variation expanded showing the Judges section with an Attach judges button.

Expand a variation to find the Judges section
  1. Select the judges you created and set the sampling percentage to 100%.
  2. Repeat for each variation.

Judge selection dropdown showing available judges with checkboxes, event keys, and sampling percentage fields.

Select judges and set sampling percentage

Step 4: Configure targeting rules

  1. Go to the Targeting tab for the Model Selector.
  2. Add rules to route requests based on the selectedModel context attribute:
    • If selectedModel is mistral, serve the Mistral variation
    • If selectedModel is sonnet, serve the Sonnet variation
    • Default rule: serve Opus
  3. For each judge, set the default rule to serve the variation you created.

Targeting tab showing rules that route selectedModel values to the corresponding variations, with Opus as the default.

Targeting rules route requests to the correct model variation

To learn more, read Custom judges.

Verify your setup

Before running the proxy, confirm in the dashboard:

  1. Model selector: Each variation shows three attached judges.
  2. Judges: Each judge prompt includes scoring criteria.
  3. Targeting: All AI Configs have targeting enabled with correct rules.

Set up the project

Create a directory and install dependencies:

$mkdir custom-evals && cd custom-evals
$python -m venv .venv && source .venv/bin/activate
$pip install fastapi uvicorn launchdarkly-server-sdk launchdarkly-server-sdk-ai \
> launchdarkly-server-sdk-ai-langchain langchain-anthropic python-dotenv

Create .env:

$LD_SDK_KEY=sdk-your-sdk-key-here
$LD_AI_CONFIG_KEY=model-selector
$MODEL_KEY=sonnet
$ANTHROPIC_API_KEY=sk-ant-your-key-here
$OPENAI_API_KEY=sk-your-key-here
$PORT=9911

Build the proxy server

Create server.py with the following code.

Click to expand the complete proxy server code
1"""
2Proxy server for Claude Code with automatic quality scoring.
3
4Routes requests through LaunchDarkly AI Configs and scores every response
5with attached judges. Metrics flow to the LaunchDarkly Monitoring dashboard.
6"""
7
8import asyncio
9import os
10import logging
11import uuid
12
13import ldclient
14from ldclient import Context
15from ldai import AICompletionConfigDefault, LDAIClient, LDMessage
16from fastapi import FastAPI, Request
17from fastapi.responses import JSONResponse
18import uvicorn
19
20from dotenv import load_dotenv
21load_dotenv()
22
23LD_SDK_KEY = os.environ.get("LD_SDK_KEY")
24LD_AI_CONFIG_KEY = os.environ.get("LD_AI_CONFIG_KEY", "model-selector")
25PORT = int(os.environ.get("PORT", "9911"))
26
27if not LD_SDK_KEY:
28 raise ValueError("Missing LD_SDK_KEY environment variable")
29
30LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO").upper()
31logging.basicConfig(level=getattr(logging, LOG_LEVEL, logging.INFO))
32
33ld_config = ldclient.Config(LD_SDK_KEY)
34ldclient.set_config(ld_config)
35ld_client = ldclient.get()
36
37if not ld_client.is_initialized():
38 raise RuntimeError("LaunchDarkly client failed to initialize")
39
40ai_client = LDAIClient(ld_client)
41app = FastAPI()
42
43# =============================================================================
44# Message Conversion
45# =============================================================================
46
47def extract_text(content) -> str:
48 """Extract plain text from Anthropic-style content."""
49 if isinstance(content, str):
50 return content
51 if isinstance(content, list):
52 texts = []
53 for block in content:
54 if isinstance(block, dict) and block.get("type") == "text":
55 texts.append(block.get("text", ""))
56 return "".join(texts)
57 return str(content or "")
58
59
60def convert_to_ld_messages(body: dict) -> list[LDMessage]:
61 """Convert Anthropic Messages API format to LDMessage format."""
62 messages = []
63
64 system = body.get("system")
65 if system:
66 system_text = extract_text(system) if isinstance(system, list) else system
67 messages.append(LDMessage(role="system", content=system_text))
68
69 for msg in body.get("messages", []):
70 role_str = msg.get("role", "user")
71 role = "assistant" if role_str == "assistant" else "user"
72 messages.append(LDMessage(role=role, content=extract_text(msg.get("content", ""))))
73
74 return messages
75
76# =============================================================================
77# Routes
78# =============================================================================
79
80@app.post("/v1/messages")
81async def handle_messages(request: Request):
82 """Main endpoint using chat.invoke() for automatic judge execution."""
83 body = await request.json()
84 user_key = request.headers.get("x-ld-user-key", "claude-code-local")
85
86 # Build context with selectedModel for targeting
87 model_key = os.environ.get("MODEL_KEY", "")
88 context = (
89 Context.builder(user_key)
90 .set("selectedModel", model_key)
91 .build()
92 )
93
94 fallback = AICompletionConfigDefault(enabled=False)
95 chat = await ai_client.create_chat(LD_AI_CONFIG_KEY, context, fallback, {})
96
97 if not chat:
98 return JSONResponse(
99 {"type": "error", "error": {"type": "unavailable", "message": "AI Config disabled"}},
100 status_code=503
101 )
102
103 config = chat.get_config()
104 model_name = config.model.name if config.model else "unknown"
105 judge_count = len(config.judge_configuration.judges) if config.judge_configuration else 0
106
107 print(f"[REQUEST] model={model_name}, judges={judge_count}")
108
109 try:
110 ld_messages = convert_to_ld_messages(body)
111
112 if len(ld_messages) > 1:
113 chat.append_messages(ld_messages[:-1])
114
115 last_message = ld_messages[-1] if ld_messages else LDMessage(role="user", content="")
116
117 # invoke() executes judges automatically based on sampling rate
118 response = await chat.invoke(last_message.content)
119
120 # Await judge evaluations and log results
121 if response.evaluations:
122 print(f"[JUDGES] Awaiting {len(response.evaluations)} evaluations...")
123 eval_results = await asyncio.gather(*response.evaluations, return_exceptions=True)
124 for result in eval_results:
125 if isinstance(result, Exception):
126 print(f"[JUDGE ERROR] {result}")
127 elif result:
128 print(f"[JUDGE] {result.to_dict()}")
129
130 # Flush events to LaunchDarkly
131 ld_client.flush()
132 await asyncio.sleep(0.1)
133
134 response_text = response.message.content if response.message else ""
135
136 # Get token metrics
137 input_tokens = 0
138 output_tokens = 0
139 if response.metrics and response.metrics.usage:
140 input_tokens = response.metrics.usage.input or 0
141 output_tokens = response.metrics.usage.output or 0
142
143 print(f"[METRICS] tokens={input_tokens}/{output_tokens}")
144
145 return JSONResponse({
146 "id": f"msg_{uuid.uuid4().hex[:24]}",
147 "type": "message",
148 "role": "assistant",
149 "content": [{"type": "text", "text": response_text}],
150 "model": model_name,
151 "stop_reason": "end_turn",
152 "usage": {
153 "input_tokens": input_tokens,
154 "output_tokens": output_tokens
155 }
156 })
157
158 except Exception as e:
159 ld_client.flush()
160 logging.exception("Request failed")
161 return JSONResponse(
162 {"type": "error", "error": {"type": "internal_error", "message": str(e)}},
163 status_code=500
164 )
165
166
167@app.get("/health")
168async def health():
169 return {"status": "ok", "launchdarkly": ld_client.is_initialized()}
170
171
172@app.post("/v1/messages/count_tokens")
173async def count_tokens(request: Request):
174 return {"input_tokens": 0}
175
176# =============================================================================
177# Main
178# =============================================================================
179
180if __name__ == "__main__":
181 print(f"Proxy running on port {PORT}")
182 print(f"AI Config: {LD_AI_CONFIG_KEY}")
183 print(f"Connect: ANTHROPIC_BASE_URL=http://localhost:{PORT} claude")
184 uvicorn.run(app, host="127.0.0.1", port=PORT, log_level="info")

Connect Claude Code to your proxy

Start the proxy server:

$python server.py

You should see output like:

Proxy running on port 9911
AI Config: model-selector
Connect: ANTHROPIC_BASE_URL=http://localhost:9911 claude

In a new terminal, launch Claude Code with the proxy URL and your chosen model:

$MODEL_KEY=sonnet ANTHROPIC_BASE_URL=http://localhost:9911 claude

Every request now routes through your proxy. Watch the server logs to see judges executing:

[REQUEST] model=claude-sonnet-4-6, judges=3
[JUDGES] Awaiting 3 evaluations...
[JUDGE] {'evals': {'security': {'score': 1.0, 'reasoning': 'No vulnerabilities detected...'}}}
[JUDGE] {'evals': {'api-contract': {'score': 0.5, 'reasoning': 'Response uses correct endpoint...'}}}
[JUDGE] {'evals': {'minimal-change': {'score': 1.0, 'reasoning': 'Changes are focused...'}}}
The key pattern for automatic judge evaluation

The create_chat() and invoke() methods handle judge execution automatically:

1chat = await ai_client.create_chat(config_key, context, fallback, {})
2response = await chat.invoke(user_message)
3# response.evaluations contains async judge tasks

Judge results are sent to LaunchDarkly automatically. You can optionally await response.evaluations to log results locally.

Tool features aren't supported

This proxy handles text-based conversations. Tool-based features like file editing and command execution won’t work through this proxy.

How model routing works

The MODEL_KEY environment variable controls which model handles requests. The proxy passes it as a selectedModel context attribute:

1context = Context.builder(user_key).set("selectedModel", model_key).build()

Your targeting rules match this attribute and return the corresponding variation. Switch models by changing the environment variable:

$MODEL_KEY=mistral ANTHROPIC_BASE_URL=http://localhost:9911 claude

Compare cloud and local models

To evaluate Ollama models against cloud providers:

  1. Add an “ollama” variation to your model-selector AI Config.
  2. Add a targeting rule for selectedModel equals “ollama”.
  3. Launch with MODEL_KEY=ollama.

Your custom judges score Claude Sonnet and Llama 3.2 with identical criteria. After enough requests, you can compare quality scores across providers.

Run experiments

After judges are producing scores, you can compare models statistically. Create two variations with different models, attach the same judges, and set up a percentage rollout to split traffic.

Your judge metrics appear as goals in LaunchDarkly Experimentation. After enough data, you can answer “Which model produces more secure code?” with confidence, not guesswork.

To learn more, read Experimentation with AI Configs.

Monitor quality over time

Judge scores appear on your AI Config’s Monitoring tab. To view evaluation metrics:

  1. Open your model-selector AI Config and go to the Monitoring tab.
  2. Select Evaluator metrics from the dropdown menu.

Select Evaluator metrics from the dropdown

  1. Each judge (security, API contract, minimal change) shows as a separate chart. Hover over a chart to see scores broken down by variation.

Security judge scores over time

API contract adherence scores

Minimal change judge scores

  1. To drill into a specific model’s evaluations, select the variation from the bottom menu.

Select a variation to see its evaluations

Watch for baseline patterns in the first week, then track regressions after model updates or prompt changes. Model providers ship updates without notice. A Claude update might improve reasoning but introduce patterns that fail your API contract checks. Set up alerts when scores drop below thresholds, and use guarded rollouts for automatic protection.

To learn more, read Monitor AI Configs.

Control costs with sampling

Each judge evaluation is an LLM call. Control costs by adjusting sampling rates:

  • Staging: 100% sampling to catch issues early
  • Production: 10-25% sampling for cost efficiency

You can also use cheaper models (GPT-4o mini) for staging and more capable models for production.

What you learned

The value is in the judges you create. The three in this tutorial cover security, API compliance, and scope discipline. Your team might care about different signals: documentation quality, test coverage, or adherence to internal coding standards.

Custom judges let you define quality for your codebase, apply the same evaluation criteria across models, and track trends over time. Once you create a judge, you can attach it to any AI Config in your project.

Start your free trial

Ready to build custom judges for your codebase? Start your 14-day free trial and deploy your first evaluation today.

Next steps

Footnotes

  1. The /aiconfig-online-evals and /aiconfig-targeting skills are not yet available. Use the dashboard to complete those steps.