Offline Evaluation of RAG-Grounded Answers in AgentControl
Offline Evaluation of RAG-Grounded Answers in AgentControl
Offline Evaluation of RAG-Grounded Answers in AgentControl
Published April 15, 2026
This tutorial shows you how to run an offline LLM evaluation on the RAG-grounded support agent you built in the Agent Graphs tutorial, using AgentControl configs, the Datasets feature, and built-in LLM-as-a-judge scoring. You’ll build a RAG-grounded test dataset, run it through the Playground with a cross-family judge, and learn how to read each failing row as a dataset issue, an agent issue, or judge calibration noise.
Here’s how it works. The LaunchDarkly Playground evaluates a single model call against a prompt and dataset you configure. By pre-computing your RAG retrieval offline and baking the chunks directly into each dataset row, you turn that call into a high-value generation test: the model in the Playground receives the same documentation context it would in production, so the eval measures how well your agent reasons over real grounded input.
Covers:
Does not cover:
About the branch and the Umbra knowledge base. The feature/offline-evals branch builds on the same Agent Graphs tutorial codebase and the routing, tool, and graph work done in earlier branches — none of that goes away. What this branch adds is a more realistic RAG assessment target: Umbra, a fictional serverless-functions product with an invented knowledge base (refund windows, deployment regions, function timeout limits, rate-limit tiers, and so on). Because Umbra doesn’t exist outside this tutorial, the model under test has no pre-training knowledge to fall back on — a correct answer has to come from the retrieved chunks, which is the only way to honestly measure whether your RAG pipeline is doing its job. The branch also ships a pre-built RAG-grounded test dataset (datasets/answer-tests.csv) and a helper script that regenerates it from your vector store.
Start the API and UI in two terminals:
Open http://localhost:8501 and ask a question grounded in the Umbra docs (refund policy, deployment regions, function timeout). The agent pulls answers from the knowledge base.
Open datasets/answer-tests.csv. Every row has three fields:
input bundles documentation chunks and the question into a single structured prompt, separated by --- dividers. The chunks were retrieved from your production vector store ahead of time by tools/build_rag_dataset.py, so the model in the Playground sees the same grounding the production agent would, even though the Playground never executes your retrieval tools.expected_output is the correct answer, written by a human who read the source docs.original_question is a plain-text copy of the question so you can scan the dataset without parsing the bundled prompt. No judge uses this field.Regenerate the dataset when your knowledge base changes:
For the full reference on dataset format and limits, see Datasets for offline evaluations.
Never upload real customer tickets, PII, secrets, or credentials. Replace anything sensitive with synthetic placeholders before upload. See the Playground privacy section for what gets forwarded to model providers.
Navigate to AI > Library in LaunchDarkly, select the Datasets tab, and click Upload dataset. Upload datasets/answer-tests.csv and name it answer-tests.
The Playground calls model providers directly, so it needs API keys for both the model running your agent and the model running your judge. These keys live in LaunchDarkly’s “AgentControl Test Run” integration, not in your config.
See the Playground reference doc for the canonical instructions. API keys are stored per-session, so you may need to re-paste them when you return.
From the Datasets list, click into answer-tests to open it in a Playground bound to that dataset.
support-agent instructions verbatim from the config. Do not edit or simplify them.0.85. Accuracy scores whether the response correctly addresses the input question, which fits grounded natural-language answers.
Run the eval.
The example run above had 18 passes and 2 failures. When a row fails, the failure comes from one of three places, and each one sends you in a different direction:
top_k, a reranker, or a different chunker, or verify the answer is indexed at all.Sort by score. For each failing row, open the bundled chunks in the input field and ask: was the right answer in there? Yes → fix the prompt or model. No → rebuild the dataset.
Row 11: “What integrations are available?” (chunks missed the answer). The expected output mentioned monitoring integrations (Datadog, Sentry, LogRocket), but the retrieved chunks only covered databases, storage, and billing. The model correctly listed what it had and said “the documentation does not provide additional information regarding more integrations”, which is the correct behavior for an ungrounded claim. Fix: higher top_k or a reranker in build_rag_dataset.py.
Row 12: “Can I get a refund on bandwidth overages?” (judge calibration). The model correctly said bandwidth overages are non-refundable, citing the docs, but omitted a secondary “Review your Usage Dashboard” recommendation from the expected output. Semantically right, lexically short one clause. Fix: lower the threshold or trim the expected output.
Two failures, two different fixes. Without reading the per-row results you’d conflate them and spend time tightening the model when the actual problem lives in the retriever or the dataset.
This tutorial walked you through one run. In practice, a single eval isn’t where offline evaluation earns its keep. The real payoff comes from re-running the same dataset against a new prompt, a new model, or a fresh RAG chunker and comparing scores to your last known-good run. A small prompt edit that quietly drops your Accuracy from 0.83 to 0.71 is exactly the kind of regression this pattern is meant to catch, but only if you save the run and compare against it next time.
A reasonable next loop:
top_k), re-run the same dataset and compare scores.For end-to-end behavior that offline tests can’t capture (tool execution, multi-turn conversations, the tail of real production inputs), see online evaluations and the When to add online evals tutorial. Online evaluations are not currently supported for agent-based configs; for agent workflows, the documented path is programmatic judge evaluation via the AI SDK.
View saved runs at AI > Evaluations. Toggle Group by dataset to collapse runs under each dataset name so you can see the history for umbra-rag-eval alongside any other datasets in the project. Compare pass and fail counts across runs, and distinguish saved runs (indefinite retention) from one-off runs (60-day expiry). For metric definitions, see Monitor config performance.
For a deeper look at the multi-agent RAG system this tutorial builds on, see the Agent Graphs tutorial. For the upstream system that produces the conversations being evaluated, see Build a LangGraph Multi-Agent system in 20 Minutes.