Create and test AI Config evals in the playground

Overview

This topic explains how to use the playground to create and run evaluations that measure the quality of AI Config variations before deploying them to production. The playground provides a secure sandbox where you can experiment with prompts, models, and parameters. You can view model outputs, attach evaluation criteria to assess model quality, and even use a separate Large Language Model (LLM) to automatically score or analyze completions according to a rubric you define.

The playground helps AI and ML teams validate quality before deploying to production. It supports fast, controlled testing of variations so you can refine prompts, models, and parameters early in the development cycle. The playground also establishes the foundation for offline evaluations, creating a clear path from experimentation to production deployment within LaunchDarkly.

The playground complements online evals. Online evals measure model performance in production using attached judges, while the playground provides a sandbox environment for pre-production testing and refinement of AI Config variations.

Who uses the playground

The playground is designed for AI developers, ML engineers, product engineers, and PMs building and shipping AI-powered products. It provides a unified environment for evaluating models, comparing configurations, and promoting the best-performing variations.

Use the playground to:

  • Create and run evaluations to test AI Config variations with real model responses.
  • Measure model quality based on evaluation criteria such as factuality, groundedness, or relevance.
  • Attach an evaluator LLM to automatically score or analyze each completion using your own rubric.
  • Adjust prompts, parameters, or variables to improve performance before release.
  • Manage and secure provider credentials in the Manage API keys section.

Each evaluation can generate multiple runs. When you change an evaluation and create a new run, previous runs remain available with their original data.

How the playground works

The playground uses the same evaluation framework as online evals but runs evaluations in a controlled sandbox. Each evaluation contains the prompt configuration, variables, model parameters, and optional evaluation criteria. When you run an evaluation, the playground records the model response, token usage, latency, and criterion scores.

Teams can define reusable evaluations that combine prompts, model, parameters, and variables or context. You can run each evaluation to generate completions and view structured results. You can also attach a secondary LLM to automatically score or analyze each response.

Data in the playground is temporary. Test data is automatically deleted after 60 days unless you save the evaluation. LaunchDarkly integrations securely store provider credentials and remove them at the end of each session.

Each playground session includes the following:

  • Evaluation setup: Prompts, parameters, variables, and provider details.
  • Run results: Model outputs, token counts, latency, and evaluation scores.
  • Isolation: Evaluations cannot modify or affect production configurations.
  • Retention: Data expires automatically after 60 days unless you save the evaluation.

When you click Save and run, LaunchDarkly securely sends your configuration to the model provider, executes the request, and returns the model output and evaluation results as a new run.

Example structured run output
{
"accuracy": { "score": 0.9, "reason": "Accurate and complete answer." },
"groundedness": { "score": 0.85, "reason": "Mostly supported by source context." },
"latencyMs": 1200,
"inputTokens": 420,
"outputTokens": 610
}

Create and manage evaluations

You can use the playground to create, edit, and delete evaluations. Each evaluation can include messages, model parameters, criteria, and variables.

Create an evaluation

  1. Navigate to your project.
  2. In the left navigation, click Playground.
  3. Click New evaluation. The Input tab of a new evaluation form appears.

The new evaluation form.

The new evaluation form.
  1. Click Untitled and enter a name for the evaluation.
  2. Select a model provider and model.
  3. Add or edit messages for the System, User, and Assistant roles. These messages define how the model interacts in a conversation:
    • System provides context or instructions that set the model’s behavior and tone.
    • User represents the input prompt or question from an end user.
    • Assistant represents the model’s response. You can include an example or leave it blank to view generated results.
  4. Attach one or more evaluation criteria. Each criterion defines a measurement, such as factuality or relevance, and includes configurable options such as threshold or control prompt.
  5. (Optional) Add variables to reuse dynamic values, such as {{productName}} or context attributes like {{ldContext.city}}.
  6. (Optional) Attach a scoring LLM to automatically evaluate each output.
  7. Click Save and run. The playground creates a new run and adds an output row with model response and evaluation scores.

Edit an evaluation

You can edit an existing evaluation at any time. Updates to prompts, parameters, or criteria apply to new runs only. Previous runs remain valid and retain their original data.

To edit an evaluation:

  1. In the Playground list, click the evaluation you want to edit.
  2. Make your changes to messages, model, parameters, or criteria.
  3. Click Save and run to generate a new run with updated evaluation data.

Delete an evaluation

To delete an evaluation:

  1. In the Playground list, find the evaluation you want to remove.
  2. Click the three-dot menu.
  3. Select Delete evaluation.
  4. Confirm deletion.

Deleting an evaluation removes its configuration and associated runs from the playground.

View evaluation runs

The Output tab of an evaluation displays all its runs.

Evaluation run results.

Evaluation run results.

Each run includes:

  • Evaluation summary
  • Scores for each attached criterion
  • Input, output, and total tokens used
  • Latency

Click a run to view details. Each run includes:

  • Raw output: The exact text or JSON object returned by the model.
  • Evaluation results: Scores and reasoning for each evaluation criterion.

Runs update automatically when new results are available. You can edit parameters, prompts, or criteria and rerun evaluations immediately.

Manage API keys

The playground uses the provider credentials stored in your LaunchDarkly project to run evaluations. You can add or update these credentials from the Manage API keys section to ensure your evaluations use the correct model access.

To manage provider API keys:

  1. In the upper right of the “Playground” page, click Manage API keys to open the Integrations page with the “AI Config Test Run” integration selected.

The Manage API keys integration page.

The Manage API keys integration page.
  1. Click Add integration.
  2. Enter a name.
  3. Select a model provider.
  4. Enter the API key for your selected provider.
  5. Read the Integration Terms and Conditions and check the box to confirm.
  6. Click Save configuration.

Only one active credential per provider is supported in each project. LaunchDarkly never stores API keys beyond the session.

Privacy

The playground may send prompts and variables to your configured model provider for evaluation. LaunchDarkly does not store or share your inputs, credentials, or outputs beyond your project.

If your organization restricts sharing personally identifiable information (PII) with external providers, ensure that your evaluation prompts and variables exclude sensitive data.

To learn more, read AI Configs and information privacy.