Datasets for offline evaluations

Overview

This topic explains how to create, manage, and use datasets for offline evaluations. Datasets define the inputs used to evaluate model behavior before release.

Offline evaluations run AI Config variations or LLM inputs against uploaded datasets and score outputs using evaluation criteria such as built-in scorers or judges defined as AI Configs. Each dataset row represents a single evaluation task.

Datasets support repeatable evaluation workflows. You can reuse the same dataset across multiple evaluation runs to compare variations and validate changes before rollout.

Prepare your dataset

To run an offline evaluation, you must provide a dataset. Offline evaluations run AI Config variations against dataset rows and score each generated output.

A dataset is a file in CSV or JSONL format. Each row represents a single evaluation task that LaunchDarkly evaluates independently during a run.

Each row can include the following fields:

  • input: The prompt or request sent to the model.
  • expected_output: The ideal or target output for the given input. Use this field to compare generated outputs against a known result.
  • context: Additional information provided alongside the input, such as retrieved documents or tool responses.
  • variables: Named values that populate placeholders in your AI Config prompt templates at runtime.

LaunchDarkly generates one model output per row and evaluates it using the configured criteria.

Example dataset row

Example dataset row
1{ "input": "What is the price of the iPhone 15?", "expected_output": "$799" }

Use this structure to compare generated outputs against expected results for known scenarios.

Upload datasets

To use a dataset in an offline evaluation, upload a CSV or JSONL file.

  1. Navigate to the Library page.
  2. Select the Datasets tab.
  3. Click Upload dataset.
  4. Drag and drop or select your dataset file.
  5. Click Save dataset.

By default, the dataset name matches the file name. You can change the dataset name before saving.

After you upload a dataset, LaunchDarkly validates and processes the file so it can be used in evaluation runs. This includes validating the file format, detecting the dataset schema, and enforcing row and size limits. LaunchDarkly also computes a dataset hash for deduplication and stores dataset metadata.

When validation completes, the dataset status updates to ready and the dataset becomes available for evaluation runs.

If validation fails, LaunchDarkly displays errors so you can correct the dataset before using it.

Manage datasets

After you upload a dataset, LaunchDarkly stores it and makes it available for use in evaluation runs.

You can reuse the same dataset across multiple evaluations. Reusing datasets lets you run evaluations with the same dataset so you can compare variations across evaluation runs.

Datasets are stored with metadata such as file name, size, and row count. LaunchDarkly also processes datasets in a way that supports deduplication and reproducible evaluation runs.

Use datasets in evaluations

When you configure an offline evaluation, you select a dataset to use as input.

You can:

  • Run evaluations against all rows in a dataset
  • Select a subset of rows using sampling
  • Reuse datasets across multiple evaluation runs

LaunchDarkly generates one model output per row and evaluates it using the configured criteria. Using the same dataset across runs lets you compare variations using the same inputs.