Offline evaluations in AI Configs
Overview
This topic explains how to use offline evaluations to validate AI Config variations before releasing them to production.
Offline evaluations provide a repeatable workflow for testing prompt and model changes using datasets with known inputs and expected outputs. These evaluations help you make informed decisions before changes impact end users.
By running variations on the same dataset and evaluation criteria, you can compare performance, detect regressions, and validate improvements with confidence.
Offline evaluations help you:
- Compare prompt and model variations using consistent inputs
- Identify regressions before changes reach production
- Measure quality using standardized criteria and judges
- Review aggregate scores and row-level results to understand performance
- Decide whether a variation is ready for rollout
Offline evaluations run on an uploaded dataset. Each row in the dataset represents a single evaluation task, and the system generates and scores one output per row. Datasets can include expected outputs, variables, and metadata, which let you test specific inputs against known values.
Using the same dataset across runs ensures consistent comparisons. This allows you to identify meaningful improvements and avoid introducing regressions.
By reusing datasets, evaluation criteria, and custom judges, teams can apply consistent quality standards across projects. Offline evaluations focus on pre-production validation and complement online evaluations, which measure model performance on live user traffic after release.
Offline and online evaluations
Offline and online evaluations serve different purposes.
Offline evaluations:
- Run before deployment
- Use datasets with known inputs and expected outputs
- Evaluate variations in a controlled environment
- Help validate changes before rollout
Online evaluations:
- Run in production on live user traffic
- Evaluate responses using attached judges
- Score responses continuously
- Help monitor performance after release
For more information about online evaluations, read Online evaluations in AI Configs.
How offline evaluations work
Offline evaluations use AI Config variations to generate and evaluate outputs for a dataset. Each row in the dataset represents a single evaluation task. This approach lets you evaluate many inputs at once and understand how a variation performs across a consistent set of scenarios.
For each input, LaunchDarkly:
- Generates a model output
- Evaluates the output using selected criteria or judges
- Records structured results
As results are processed, LaunchDarkly aggregates metrics across the dataset.
LaunchDarkly combines these results to provide a complete view of performance, including overall metrics and individual results.
Example evaluation result
The following example shows the result for a single dataset row, including a score and reasoning returned by the evaluation criteria or judge.
In this example, the score indicates how well the output meets the selected criterion, and the reasoning explains why the score was assigned. You can use these results to compare variations, identify low-scoring outputs, and understand where a model may need improvement.
Configure an offline evaluation
Configure an offline evaluation to define the dataset, AI Config variation, model, evaluation criteria, and execution settings such as sampling and runtime limits.
Use this configuration to control what inputs are tested and how outputs are evaluated. For example, you can use sampling to test a subset of inputs, run a preview to validate your setup, and define thresholds that reflect your quality standards.
To access and configure an offline evaluation:
- Navigate to your project.
- In the left navigation, expand AI, then select Playground.
- Create or open an evaluation in the Playground. For detailed steps, read Create and manage evaluations.
- Configure the evaluation, including selecting a dataset and AI Config variation.
- Configure evaluation criteria and thresholds in the Acceptance criteria panel.
- Choose how dataset rows are sampled using the row selection controls.
- (Optional) Run a preview on a subset of rows.
After configuration, start the evaluation run. The evaluation run captures the evaluation setup so you can repeat the evaluation and compare results over time.
Dataset requirements
Offline evaluations use uploaded datasets as input for AI Config variations. To learn how to prepare and upload datasets, read Datasets in AI Configs.
Datasets must be in CSV or JSONL format and can include expected output, variables, and metadata. Use datasets to validate outputs by comparing them to expected results.
Dataset schema includes:
input: Prompt or request used to generate a model response. Accepts a string or JSON object.expected_output: Optional. Ideal output used for comparison or scoring. Accepts a string or JSON object.variables: Optional. Named placeholders used in prompt templates. These are substituted into message templates. Example:{{variable_name}}metadata: Optional. JSON object of arbitrary key-value data attached to a row for tracking, filtering, or reporting. Metadata is stored alongside results but is not used in generation or evaluation. Example:{"source": "production", "category": "factual"})
When a dataset is uploaded, LaunchDarkly validates the file format, detects the schema, and enforces row and size limits. It also computes a dataset hash for deduplication and stores dataset metadata.
If validation fails, LaunchDarkly returns errors so you can correct issues before running the evaluation.
Run and review evaluations
Run offline evaluations to understand how your AI Config variations perform across a dataset before deciding whether to release changes.
When you start an evaluation run, LaunchDarkly processes dataset rows asynchronously. Each row is processed independently to generate outputs, apply evaluation criteria, and record results.
Results are available during and after execution. As rows are processed, progress updates continuously so you can monitor evaluation status.
Complete evaluation results include:
- Status counts for rows
- Aggregate scores per criterion
- Latency and token usage metrics
- Row-level outputs and scores
You can use these results to inspect outputs, understand how scores are assigned, and compare variations. These results let you measure quality using consistent criteria, identify regressions, and decide whether a variation is ready for rollout.
Results can be exported as CSV or JSONL for further analysis.