Offline evaluations

This topic explains how to use offline evaluations to validate config variations before releasing them to production.

Offline evaluations provide a repeatable workflow for testing prompt and model changes using datasets with known inputs and expected outputs. These evaluations help you make informed decisions before changes impact end users.

By running variations on the same dataset and evaluation criteria, you can compare performance, detect regressions, and validate improvements with confidence.

Offline evaluations help you:

Compare prompt and model variations using consistent inputs
Identify regressions before changes reach production
Measure quality using standardized criteria and judges
Review aggregate scores and row-level results to understand performance
Decide whether a variation is ready for rollout

Offline evaluations run on an uploaded dataset. Each row in the dataset represents a single evaluation task, and the system generates and scores one output per row. Datasets can include expected outputs, variables, and metadata, which let you test specific inputs against known values.

Using the same dataset across runs ensures consistent comparisons. This allows you to identify meaningful improvements and avoid introducing regressions.

By reusing datasets, evaluation criteria, and custom judges, teams can apply consistent quality standards across projects. Offline evaluations focus on pre-production validation and complement online evaluations, which measure model performance on live user traffic after release.

Offline and online evaluations

Offline and online evaluations serve different purposes.

Offline evaluations:

Run before deployment
Use datasets with known inputs and expected outputs
Evaluate variations in a controlled environment
Help validate changes before rollout

Online evaluations:

Run in production on live user traffic
Evaluate responses using attached judges
Score responses continuously
Help monitor performance after release

For more information about online evaluations, read Online evaluations.

How offline evaluations work

Offline evaluations use config variations to generate and evaluate outputs for a dataset. Each row in the dataset represents a single evaluation task. This approach lets you evaluate many inputs at once and understand how a variation performs across a consistent set of scenarios.

For each input, LaunchDarkly:

Generates a model output
Evaluates the output using selected criteria or judges
Records structured results

As results are processed, LaunchDarkly aggregates metrics across the dataset.

LaunchDarkly combines these results to provide a complete view of performance, including overall metrics and individual results.

Example evaluation result

The following example shows the result for a single dataset row, including a score and reasoning returned by the evaluation criteria or judge.

Example offline evaluation result

1 {
2   "criterion_type": "factuality",
3   "score": 0.92,
4   "reason": "The answer is mostly factual"
5 }

In this example, the score indicates how well the output meets the selected criterion, and the reasoning explains why the score was assigned. You can use these results to compare variations, identify low-scoring outputs, and understand where a model may need improvement.

Configure an offline evaluation

Configure an offline evaluation to define the dataset, config variation, model, evaluation criteria, and execution settings such as sampling and runtime limits.

Use this configuration to control what inputs are tested and how outputs are evaluated. For example, you can use sampling to test a subset of inputs, run a preview to validate your setup, and define thresholds that reflect your quality standards.

To access and configure an offline evaluation:

Navigate to your project.
In the left sidebar, click Agents. The AgentControl menu appears.
Click Configs.
Click Playgrounds.
Create or open an evaluation in the playground. For detailed steps, read Create and manage evaluations.
Configure the evaluation, including selecting a dataset and config variation.
Configure evaluation criteria and thresholds in the Acceptance criteria panel.
Choose how dataset rows are sampled using the row selection controls.
(Optional) Run a preview on a subset of rows.

After configuration, start the evaluation run. The evaluation run captures the evaluation setup so you can repeat the evaluation and compare results over time.

Dataset requirements

Offline evaluations use uploaded datasets as input for config variations. To learn how to prepare and upload datasets, read Datasets.

Datasets must be in CSV or JSONL format and can include expected output, variables, and metadata. Use datasets to validate outputs by comparing them to expected results.

Dataset schema includes:

input: Prompt or request used to generate a model response. Accepts a string or JSON object.
expected_output: Optional. Ideal output used for comparison or scoring. Accepts a string or JSON object.
variables: Optional. Named placeholders used in prompt templates. These are substituted into message templates. Example: {{variable_name}}
metadata: Optional. JSON object of arbitrary key-value data attached to a row for tracking, filtering, or reporting. Metadata is stored alongside results but is not used in generation or evaluation. Example: {"source": "production", "category": "factual"}

When a dataset is uploaded, LaunchDarkly validates the file format, detects the schema, and enforces row and size limits. It also computes a dataset hash for deduplication and stores dataset metadata.

If validation fails, LaunchDarkly returns errors so you can correct issues before running the evaluation.

Run and review evaluations

Run offline evaluations to understand how your config variations perform across a dataset before deciding whether to release changes.

When you start an evaluation run, LaunchDarkly processes dataset rows asynchronously. Each row is processed independently to generate outputs, apply evaluation criteria, and record results.

Results are available during and after execution. As rows are processed, progress updates continuously so you can monitor evaluation status.

Complete evaluation results include:

Status counts for rows
Aggregate scores per criterion
Latency and token usage metrics
Row-level outputs and scores

You can use these results to inspect outputs, understand how scores are assigned, and compare variations. These results let you measure quality using consistent criteria, identify regressions, and decide whether a variation is ready for rollout.

Results can be exported as CSV or JSONL for further analysis.