Introduction

Evaluations measure how well your LLM application performs on a set of test cases. You take a dataset of input/output pairs, run each input through a model configuration, and score the results against expected outputs.

This is how you answer questions like:

"Did switching from gpt-4o to gpt-4o-mini make our answers worse?"
"Does the new system prompt produce better responses?"
"What percentage of our test cases does this model get right?"

The workflow

The evaluation workflow in Traceway follows five steps:

1. Collect production data

Your application sends traces to Traceway. Every LLM call, tool invocation, and custom step is recorded as a span with its full input and output.

2. Build a dataset

Browse your traces in the dashboard. When you find an interesting span — a good response, a bad one, an edge case — export it to a dataset. The span's input becomes the datapoint's input, and the output becomes the expected output.

You can also import data from files (JSON, JSONL, CSV) or create datapoints manually via the API.

3. Set up a provider connection

Before running an eval, you need credentials for the model you want to test. Create a provider connection in the dashboard (Settings > Provider Connections) or via the API:

curl -X POST https://api.traceway.ai/api/provider-connections \
  -H "Authorization: Bearer tw_sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "openai-prod",
    "provider": "openai",
    "api_key": "sk-...",
    "default_model": "gpt-4o-mini"
  }'

Provider connections support OpenAI, Anthropic, and any OpenAI-compatible endpoint.

4. Run an evaluation

Start an eval run on your dataset. You specify the provider connection, model, optional system prompt, temperature, and scoring strategy.

curl -X POST "https://api.traceway.ai/api/datasets/${DATASET_ID}/eval" \
  -H "Authorization: Bearer tw_sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "gpt-4o-mini baseline",
    "config": {
      "connection_id": "CONNECTION_ID",
      "model": "gpt-4o-mini",
      "system_prompt": "Answer concisely.",
      "temperature": 0.0
    },
    "scoring": "ExactMatch"
  }'

The eval runs asynchronously in the background. Traceway iterates through each datapoint, calls the model, scores the result, and emits progress events via SSE.

5. Compare results

Run the same dataset with different configurations — different models, prompts, or temperatures — and compare the results side by side. Traceway shows you the output and score for each datapoint across all runs.

When to use evaluations

Regression testing — Before deploying a prompt change, run your golden dataset against the new prompt and compare with the previous run. If scores drop, the change isn't safe.

Model comparison — Evaluating whether a cheaper model (e.g., gpt-4o-mini vs. gpt-4o) produces acceptable results for your use case.

Prompt iteration — Testing multiple system prompt variations against the same dataset to find the best one.

Quality monitoring — Running periodic evals against a growing dataset built from production capture rules to track quality over time.

Eval run lifecycle

Each eval run has a status:

Status	Description
`pending`	Created but not yet started
`running`	Actively processing datapoints
`completed`	All datapoints processed
`failed`	An unrecoverable error occurred
`cancelled`	Cancelled by the user

While a run is in running status, the progress and total fields tell you how many datapoints have been processed. The dashboard shows a live progress bar.

The workflow

1. Collect production data

2. Build a dataset

3. Set up a provider connection

4. Run an evaluation

5. Compare results

When to use evaluations

Eval run lifecycle

Next steps

Running Evaluations

Scoring Strategies

Comparing Runs

On this page