For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Sign inTry it free
DocsGuidesSDKsIntegrationsAPI docsTutorialsFlagship blog
DocsGuidesSDKsIntegrationsAPI docsTutorialsFlagship blog
  • Tutorials
    • The AI Iteration Loop for Deploying Reliable Agents with LangGraph
    • Using LaunchDarkly feature flags and Experimentation with Wordpress
    • Migrate a Hardcoded LangGraph Agent to LaunchDarkly AgentControl in 20 Minutes
    • Offline Evaluation of RAG-Grounded Answers in AgentControl
    • Beyond n8n for Workflow Automation: Agent Graphs as Your Universal Agent Harness
    • Catch your first silent AI failure with Vega AI in under 10 minutes
    • Evaluate LLM code generation with LLM-as-judge evaluators
    • OpenTelemetry for LLM Applications: A Practical Guide with LaunchDarkly and Langfuse
    • Use LaunchDarkly Agent Skills in Claude Code and Cursor
    • Detection to Resolution: Real World Debugging with Rage Clicks and Session Replay
    • Compare AI orchestrators: LangGraph vs Strands vs OpenAI Swarm
    • Building a data extraction pipeline with LaunchDarkly
    • Day 12 | 🎊 New Year, New Observability
    • Day 11 | ✉️ Letters to Santa: What engineering teams really want from Observability in 2026
    • Day 10 | Why observability and feature flags go together like milk and cookies
    • Day 9 | 👻 The Three Ghosts Haunting Your AI This Holiday Season
    • Day 7 | 🎄✨The Rockefeller tree in NYC: SLOs that actually drive decisions
    • Day 6 | 💸 The famous green character that stole your cloud budget: the cardinality problem
    • Day 5 | 🧹 Using a Popular Tidying Method to Consolidate Your Observability Stack
    • Day 4 | ❄️ Tracing the impact of holiday styling in your Node.js app
    • Day 8 | 🎁 Observable Multi-Modal Agentic Systems
    • Day 3 | 🔔 Jingle All the Way to Zero-Config Observability
    • Day 2 | 🎅 He knows if you have been bad or good... But what if he gets it wrong?
    • Collecting user feedback in your app with feature flags
    • Day 1 | 🎄 Observability Under the Tree: What Changed in 2025
    • Build a User Frustration Detection & Response System
    • When to Add Online Evals to Your AgentControl
    • Detecting User Frustration: Understanding Rage Clicks and Session Replay
    • AgentControl config CI/CD Pipeline: Automated Quality Gates and Safe Deployment
    • A Deeper Look at LaunchDarkly Architecture: More than Feature Flags
    • Add Observability to Your React Native App in 5 minutes
    • Smart AI Agent Targeting with MCP Tools
    • Build a LangGraph Multi-Agent System in 20 Minutes with LaunchDarkly AgentControl
    • Snowflake Cortex Completion API + LaunchDarkly SDK Integration
    • Using AgentControl to review database changes
    • How to implement WebSockets and kill switches in a Python application
    • 4 hacks to turbocharge your Cursor productivity
    • Create a feature flag in your IDE in 5 minutes with LaunchDarkly's MCP server
    • Observability for Your Go ORM: OpenTelemetry Integration with GORM
    • The complete guide to OpenTelemetry in Next.js
    • How to instrument your React Native app with OpenTelemetry
    • The complete guide to OpenTelemetry in Python
    • Monitoring Browser Applications with OpenTelemetry
    • How to Use OpenTelemetry to Monitor Next.js Applications
    • What is OpenTelemetry and Why Should I Care?
    • Distributed Tracing in Next.js Apps
    • Tracing Distributed Systems in Next.js
    • Real-time Monitoring in Django: Essential Tools and Techniques
    • DeepSeek vs Qwen: local model showdown featuring LaunchDarkly AgentControl
    • Application Tracing in .NET for Performance Monitoring
    • The Ultimate Guide to Ruby Logging: Best Libraries and Practices
    • Using Materialized Views in ClickHouse (vs. Postgres)
    • Filtering and Sampling LaunchDarkly Ingest
    • How to Set Up Your Production AWS MSK Kafka Cluster
    • Publishing an NPM Package with Private pnpm Monorepo Dependencies
    • How To Use The Chrome Inspector & Debugger
    • 3 Levels of Data Validation in a Full Stack Application With React
    • The power of the monorepo: Keep your fullstack app in sync!
    • Compression: The simple, powerful upgrade for your web stack
    • Video tutorials
Sign inTry it free
LogoLogo
On this page
  • Overview
  • What You’ll Learn
  • Prerequisites
  • Step 1: Get the Branch Running
  • Step 2: Understand the Test Dataset
  • Step 3: Upload the Dataset
  • Step 4: Add Your Model API Keys
  • Step 5: Run the Evaluation
  • Configure the test
  • Reading the results
  • What failed in this run
  • Where to Go From a Single Run
  • Step 7: Track Evaluation History
  • What’s Next
Tutorials

Offline Evaluation of RAG-Grounded Answers in AgentControl

Was this page helpful?
Previous

Beyond n8n for Workflow Automation: Agent Graphs as Your Universal Agent Harness

Next
Built with

Published April 15, 2026

Portrait of Scarlett Attensil.

by Scarlett Attensil

Overview

This tutorial shows you how to run an offline LLM evaluation on the RAG-grounded support agent you built in the Agent Graphs tutorial, using AgentControl configs, the Datasets feature, and built-in LLM-as-a-judge scoring. You’ll build a RAG-grounded test dataset, run it through the Playground with a cross-family judge, and learn how to read each failing row as a dataset issue, an agent issue, or judge calibration noise.

Here’s how it works. The LaunchDarkly Playground evaluates a single model call against a prompt and dataset you configure. By pre-computing your RAG retrieval offline and baking the chunks directly into each dataset row, you turn that call into a high-value generation test: the model in the Playground receives the same documentation context it would in production, so the eval measures how well your agent reasons over real grounded input.

What You’ll Learn

  • Structure a RAG-grounded test dataset by pre-computing retrieval offline and bundling chunks into each row
  • Pick the right LLM judge for your agent’s output shape (Accuracy for natural-language answers, Likeness for structured labels)
  • Avoid same-model bias by running the judge on a different model family than the agent
  • Diagnose failing rows as dataset issues, agent issues, or judge calibration noise
What this tutorial covers, and what it doesn't

Covers:

  • Generation quality over RAG context: does the model produce a correct answer when the right documentation is in the prompt?
  • Regression detection: catching unexpected score drops when you change a prompt or model
  • Variation selection: comparing candidate prompts and models before committing to a new config variation

Does not cover:

  • Retrieval correctness. Whether your vector store is returning the best chunks is tested by your own RAG pipeline, outside LaunchDarkly.
  • End-to-end agent graph behavior. Tool execution, multi-turn conversations, handoffs, and multi-step routing require online evals against real production traffic.

Prerequisites

  • You’ve completed the Agent Graphs tutorial or have equivalent familiarity with AgentControl
  • You have the devrel-agents-tutorial repo cloned
  • You have API keys for two model providers, one for the agent under test and one for the judge (the examples use OpenAI and Anthropic)

Step 1: Get the Branch Running

About the branch and the Umbra knowledge base. The feature/offline-evals branch builds on the same Agent Graphs tutorial codebase and the routing, tool, and graph work done in earlier branches — none of that goes away. What this branch adds is a more realistic RAG assessment target: Umbra, a fictional serverless-functions product with an invented knowledge base (refund windows, deployment regions, function timeout limits, rate-limit tiers, and so on). Because Umbra doesn’t exist outside this tutorial, the model under test has no pre-training knowledge to fall back on — a correct answer has to come from the retrieved chunks, which is the only way to honestly measure whether your RAG pipeline is doing its job. The branch also ships a pre-built RAG-grounded test dataset (datasets/answer-tests.csv) and a helper script that regenerates it from your vector store.

$cd devrel-agents-tutorial
$git checkout feature/offline-evals
$cp .env.example .env
$# Add LD_SDK_KEY, LD_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY to .env
$
$uv sync
$uv run python bootstrap/create_configs.py
$uv run python initialize_embeddings.py

Start the API and UI in two terminals:

$# Terminal 1
$uv run uvicorn api.main:app --reload
$
$# Terminal 2
$uv run streamlit run ui/chat_interface.py

Open http://localhost:8501 and ask a question grounded in the Umbra docs (refund policy, deployment regions, function timeout). The agent pulls answers from the knowledge base.

The Umbra support chat UI answering a question grounded in the Umbra knowledge base.

Step 2: Understand the Test Dataset

Open datasets/answer-tests.csv. Every row has three fields:

1input,expected_output,original_question
2"Documentation context: --- We offer a 30-day refund policy for first-time subscribers... --- Annual subscriptions receive a prorated refund within... --- Question: What is the refund policy?","30-day refund policy for first-time subscribers who haven't deployed production traffic. Usage charges are non-refundable.","What is the refund policy?"
  • input bundles documentation chunks and the question into a single structured prompt, separated by --- dividers. The chunks were retrieved from your production vector store ahead of time by tools/build_rag_dataset.py, so the model in the Playground sees the same grounding the production agent would, even though the Playground never executes your retrieval tools.
  • expected_output is the correct answer, written by a human who read the source docs.
  • original_question is a plain-text copy of the question so you can scan the dataset without parsing the bundled prompt. No judge uses this field.

Regenerate the dataset when your knowledge base changes:

$uv run python tools/build_rag_dataset.py

For the full reference on dataset format and limits, see Datasets for offline evaluations.

Step 3: Upload the Dataset

Use synthetic data only

Never upload real customer tickets, PII, secrets, or credentials. Replace anything sensitive with synthetic placeholders before upload. See the Playground privacy section for what gets forwarded to model providers.

Navigate to AI > Library in LaunchDarkly, select the Datasets tab, and click Upload dataset. Upload datasets/answer-tests.csv and name it answer-tests.

The LaunchDarkly Datasets tab showing the answer-tests dataset uploaded.

Step 4: Add Your Model API Keys

The Playground calls model providers directly, so it needs API keys for both the model running your agent and the model running your judge. These keys live in LaunchDarkly’s “AgentControl Test Run” integration, not in your config.

  1. In the Playground, click Manage API keys in the upper-right corner.
  2. Click Add integration, pick a provider (e.g. OpenAI), paste your API key, accept the terms, and save.
  3. Repeat for the second provider (Anthropic) so you can run a cross-family judge in Step 5.
The Manage API keys page in LaunchDarkly for adding a model provider credential to the Playground.

See the Playground reference doc for the canonical instructions. API keys are stored per-session, so you may need to re-paste them when you return.

Step 5: Run the Evaluation

From the Datasets list, click into answer-tests to open it in a Playground bound to that dataset.

Configure the test

  • System prompt: paste your support-agent instructions verbatim from the config. Do not edit or simplify them.
  • Agent model: pick the model your support-agent variation uses (or a candidate you’re considering swapping to). To compare two candidates, run the eval twice with different agent models and compare scores.
  • Acceptance criteria: attach an Accuracy judge with threshold 0.85. Accuracy scores whether the response correctly addresses the input question, which fits grounded natural-language answers.
  • Evaluation model: uncheck Use same model for evaluation and set the judge to a different model family from the agent. Same-family judging tends to reward output patterns the judge itself produces. A cross-family judge gives you an independent read.
The Playground configured with the support-agent prompt, OpenAI as the agent, Anthropic as the evaluation model, and an Accuracy judge at 0.85 threshold.

Run the eval.

Reading the results

The evaluation results showing 18 of 20 rows passing the Accuracy judge.

The example run above had 18 passes and 2 failures. When a row fails, the failure comes from one of three places, and each one sends you in a different direction:

  • The dataset’s chunks don’t contain the answer. This is a retrieval problem, not a generation problem. Rebuild the dataset with higher top_k, a reranker, or a different chunker, or verify the answer is indexed at all.
  • The chunks contain the answer but the model ignored them. This is the agent-side failure offline evals are designed to catch. Tighten the system prompt to insist on grounding, or switch to a more obedient model.
  • The chunks and the model are both fine but the judge disagreed. This is judge calibration noise. Lower the threshold, try a different judge, or accept it as noise. Don’t change your agent based on it.

Sort by score. For each failing row, open the bundled chunks in the input field and ask: was the right answer in there? Yes → fix the prompt or model. No → rebuild the dataset.

What failed in this run

Row 11: “What integrations are available?” (chunks missed the answer). The expected output mentioned monitoring integrations (Datadog, Sentry, LogRocket), but the retrieved chunks only covered databases, storage, and billing. The model correctly listed what it had and said “the documentation does not provide additional information regarding more integrations”, which is the correct behavior for an ungrounded claim. Fix: higher top_k or a reranker in build_rag_dataset.py.

Row 12: “Can I get a refund on bandwidth overages?” (judge calibration). The model correctly said bandwidth overages are non-refundable, citing the docs, but omitted a secondary “Review your Usage Dashboard” recommendation from the expected output. Semantically right, lexically short one clause. Fix: lower the threshold or trim the expected output.

Two failures, two different fixes. Without reading the per-row results you’d conflate them and spend time tightening the model when the actual problem lives in the retriever or the dataset.

Where to Go From a Single Run

This tutorial walked you through one run. In practice, a single eval isn’t where offline evaluation earns its keep. The real payoff comes from re-running the same dataset against a new prompt, a new model, or a fresh RAG chunker and comparing scores to your last known-good run. A small prompt edit that quietly drops your Accuracy from 0.83 to 0.71 is exactly the kind of regression this pattern is meant to catch, but only if you save the run and compare against it next time.

A reasonable next loop:

  1. Save the run from Step 5 as your reference.
  2. When you change something (prompt, model, chunker, top_k), re-run the same dataset and compare scores.
  3. Add new rows to the dataset as you find failure modes in staging or production.

For end-to-end behavior that offline tests can’t capture (tool execution, multi-turn conversations, the tail of real production inputs), see online evaluations and the When to add online evals tutorial. Online evaluations are not currently supported for agent-based configs; for agent workflows, the documented path is programmatic judge evaluation via the AI SDK.

Step 7: Track Evaluation History

View saved runs at AI > Evaluations. Toggle Group by dataset to collapse runs under each dataset name so you can see the history for umbra-rag-eval alongside any other datasets in the project. Compare pass and fail counts across runs, and distinguish saved runs (indefinite retention) from one-off runs (60-day expiry). For metric definitions, see Monitor config performance.

The Evaluations page with Group by dataset enabled, showing saved runs collapsed under each dataset name with their prompts, models, and pass/fail status.

What’s Next

  • Progressive rollouts: release your winning variation to 5% of traffic, then 25%, then 100%, watching production metrics before expanding.
  • When to add online evals: decide what to score on live production traffic once you have an offline baseline.
  • Evaluate LLM code generation with LLM-as-judge evaluators: apply the same offline-judge pattern to code-generation outputs with custom judges.
  • Proving ROI with data-driven AI agent experiments: once you have a reliable offline baseline, A/B test variations on live traffic to prove which one wins.

For a deeper look at the multi-agent RAG system this tutorial builds on, see the Agent Graphs tutorial. For the upstream system that produces the conversations being evaluated, see Build a LangGraph Multi-Agent system in 20 Minutes.