Evaluate LLM code generation with LLM-as-judge evaluators
Published March 3, 2026
Which AI model writes the best code for your codebase? Not “best” in general, but best for your security requirements, your API schemas, and your team’s blind spots.
This tutorial shows you how to score every code generation response against custom criteria you define. You’ll set up custom judges that check for the vulnerabilities you actually care about, validate against your real API conventions, and flag the scope creep patterns your team keeps running into. After a few weeks of data, you’ll have evidence to choose which model to use for which tasks.
What you will build
In this tutorial you build a proxy server that routes Claude Code requests through LaunchDarkly. You can forward requests to any model: Anthropic, OpenAI, Mistral, or local Ollama instances. Every response gets scored by custom judges you create.
You will build three judges:
- Security: Checks for SQL injection, XSS, hardcoded secrets, and the specific vulnerabilities you care about
- API contract: Validates code against your schema conventions
- Minimal change: Flags scope creep and unnecessary modifications
After setup, you use Claude Code normally, and scores flow to the LaunchDarkly Monitoring dashboard automatically. Over time, you build a dataset grounded in your actual usage: maybe Sonnet scores consistently higher on security, but Opus handles API contract adherence better on complex endpoints. That’s the kind of answer a generic benchmark can’t give you.
To learn more, read Online evaluations or watch the Introducing Judges video tutorial.
Prerequisites
- LaunchDarkly account with AI Configs enabled
- Python 3.9+
- LaunchDarkly Python AI SDK v0.14.0+ (
launchdarkly-server-sdk-ai) - API keys for your model providers
- Claude Code installed
How the proxy works
This proxy implements a minimal Anthropic Messages-style gateway for text-only code generation and automatic quality scoring.
When Claude Code sends a request to POST /v1/messages, the proxy:
-
Extracts text-only prompts. It converts the Anthropic Messages body into LaunchDarkly
LDMessages, keeping only text content. It ignores tool blocks, images, and other non-text content. -
Routes the request through LaunchDarkly AI Configs. The proxy creates a context with a
selectedModelattribute. Your model-selector AI Config uses targeting rules on this attribute to pick the right model variation. -
Invokes the model and triggers judges. The proxy calls
chat.invoke(). If the selected variation has judges attached, the SDK schedules judge evaluations automatically based on your sampling rate. Scores flow to LaunchDarkly Monitoring. -
Returns a standard Messages response. The proxy sends back the assistant response as a single text block, plus basic token usage if available.
Claude Code talks to a local /v1/messages endpoint. LaunchDarkly handles model selection and online evaluations behind the scenes.
Create the AI Config and judges
You can use the LaunchDarkly dashboard or Claude Code with agent skills. Agent skills are faster if you have them installed.1
Option A: Agent skills
Create the project:
Create the model selector:
Create the security judge:
Create the API contract judge:
Create the minimal change judge:
Attach judges to the model selector:
Set up targeting:
For each AI Config, go to the Targeting tab and edit the default rule to serve the variation you created. For the model selector, also add rules that match the selectedModel context attribute:
When the proxy sends selectedModel: "sonnet", LaunchDarkly returns the Sonnet variation. To learn more, read Target with AI Configs.
Option B: LaunchDarkly dashboard
Step 1: Create the model selector config
- Go to AI Configs and click Create AI Config.
- Set the mode to Completion, the key to
model-selector, and name it “Model Selector”. - Add three variations with empty messages (this config acts as a router):
- Sonnet (key:
sonnet) usingclaude-sonnet-4-6 - Opus (key:
opus) usingclaude-opus-4-6 - Mistral (key:
mistral) usingmistral-large@2407
- Sonnet (key:

Step 2: Create the judge AI Configs
- Click Create AI Config and set the mode to Judge.
- Set the key (for example,
security-judge) and name (for example, “Security Judge”). - Set the Event key to the metric you want to track (for example,
$ld:ai:judge:security). - Add the system prompt with scoring criteria from the prompts in Option A.
- Set the model to
gpt-5-miniwith temperature0.3. - Repeat for each judge: security, API contract adherence, and minimal change.

Step 3: Attach judges to the model selector
- Open the Model Selector AI Config and go to the Variations tab.
- Expand a variation (for example, Sonnet) and find the Judges section.
- Click Attach judges.

- Select the judges you created and set the sampling percentage to 100%.
- Repeat for each variation.

Step 4: Configure targeting rules
- Go to the Targeting tab for the Model Selector.
- Add rules to route requests based on the
selectedModelcontext attribute:- If
selectedModelismistral, serve the Mistral variation - If
selectedModelissonnet, serve the Sonnet variation - Default rule: serve Opus
- If
- For each judge, set the default rule to serve the variation you created.

To learn more, read Custom judges.
Verify your setup
Before running the proxy, confirm in the dashboard:
- Model selector: Each variation shows three attached judges.
- Judges: Each judge prompt includes scoring criteria.
- Targeting: All AI Configs have targeting enabled with correct rules.
Set up the project
Create a directory and install dependencies:
Create .env:
Build the proxy server
Create server.py with the following code.
Click to expand the complete proxy server code
Connect Claude Code to your proxy
Start the proxy server:
You should see output like:
In a new terminal, launch Claude Code with the proxy URL and your chosen model:
Every request now routes through your proxy. Watch the server logs to see judges executing:
The key pattern for automatic judge evaluation
The create_chat() and invoke() methods handle judge execution automatically:
Judge results are sent to LaunchDarkly automatically. You can optionally await response.evaluations to log results locally.
Tool features aren't supported
This proxy handles text-based conversations. Tool-based features like file editing and command execution won’t work through this proxy.
How model routing works
The MODEL_KEY environment variable controls which model handles requests. The proxy passes it as a selectedModel context attribute:
Your targeting rules match this attribute and return the corresponding variation. Switch models by changing the environment variable:
Compare cloud and local models
To evaluate Ollama models against cloud providers:
- Add an “ollama” variation to your model-selector AI Config.
- Add a targeting rule for
selectedModelequals “ollama”. - Launch with
MODEL_KEY=ollama.
Your custom judges score Claude Sonnet and Llama 3.2 with identical criteria. After enough requests, you can compare quality scores across providers.
Run experiments
After judges are producing scores, you can compare models statistically. Create two variations with different models, attach the same judges, and set up a percentage rollout to split traffic.
Your judge metrics appear as goals in LaunchDarkly Experimentation. After enough data, you can answer “Which model produces more secure code?” with confidence, not guesswork.
To learn more, read Experimentation with AI Configs.
Monitor quality over time
Judge scores appear on your AI Config’s Monitoring tab. To view evaluation metrics:
- Open your model-selector AI Config and go to the Monitoring tab.
- Select Evaluator metrics from the dropdown menu.

- Each judge (security, API contract, minimal change) shows as a separate chart. Hover over a chart to see scores broken down by variation.



- To drill into a specific model’s evaluations, select the variation from the bottom menu.

Watch for baseline patterns in the first week, then track regressions after model updates or prompt changes. Model providers ship updates without notice. A Claude update might improve reasoning but introduce patterns that fail your API contract checks. Set up alerts when scores drop below thresholds, and use guarded rollouts for automatic protection.
To learn more, read Monitor AI Configs.
Control costs with sampling
Each judge evaluation is an LLM call. Control costs by adjusting sampling rates:
- Staging: 100% sampling to catch issues early
- Production: 10-25% sampling for cost efficiency
You can also use cheaper models (GPT-4o mini) for staging and more capable models for production.
What you learned
The value is in the judges you create. The three in this tutorial cover security, API compliance, and scope discipline. Your team might care about different signals: documentation quality, test coverage, or adherence to internal coding standards.
Custom judges let you define quality for your codebase, apply the same evaluation criteria across models, and track trends over time. Once you create a judge, you can attach it to any AI Config in your project.
Start your free trial
Ready to build custom judges for your codebase? Start your 14-day free trial and deploy your first evaluation today.
Next steps
- hello-python-ai examples for more judge patterns
- AI Configs best practices for production patterns
Footnotes
-
The
/aiconfig-online-evalsand/aiconfig-targetingskills are not yet available. Use the dashboard to complete those steps. ↩
