Traceway
Evaluations

Running Evaluations

How to start an eval run, configure the model, and check progress.

Prerequisites

Before starting an eval, you need:

  1. A dataset with at least one datapoint. See Datasets for how to create one.
  2. A provider connection — credentials for the model you want to evaluate.

Creating a provider connection

Provider connections store the API credentials and default configuration for a model provider. Create one in the dashboard under Settings > Provider Connections, or via the API:

curl -X POST https://api.traceway.ai/api/provider-connections \
  -H "Authorization: Bearer tw_sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "openai-prod",
    "provider": "openai",
    "api_key": "sk-proj-...",
    "base_url": "https://api.openai.com/v1",
    "default_model": "gpt-4o-mini"
  }'
FieldRequiredDescription
nameYesA human-readable label
providerYesopenai, anthropic, or custom
api_keyYesThe API key for this provider
base_urlNoOverride the API base URL (for custom OpenAI-compatible endpoints)
default_modelNoDefault model to use when not specified in the eval config

Provider connections are stored encrypted. The API key is never returned in API responses after creation.

Starting a run

Start an eval run against a dataset:

curl -X POST "https://api.traceway.ai/api/datasets/${DATASET_ID}/eval" \
  -H "Authorization: Bearer tw_sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "gpt-4o-mini v2 prompt",
    "config": {
      "connection_id": "01J...",
      "model": "gpt-4o-mini",
      "system_prompt": "You are a helpful assistant. Answer concisely and accurately.",
      "temperature": 0.0
    },
    "scoring": "ExactMatch"
  }'

Configuration fields

FieldTypeRequiredDescription
namestringYesName for this run (e.g., "gpt-4o-mini baseline")
config.connection_idstringYesID of the provider connection to use
config.modelstringYesModel identifier (e.g., gpt-4o-mini, claude-3-haiku)
config.system_promptstringNoSystem prompt prepended to every datapoint
config.temperaturenumberNoModel temperature (0.0 to 2.0, default provider-specific)
scoringstringYesScoring strategy: ExactMatch, Contains, LlmJudge, or None

Using the SDK

// There isn't a direct SDK method for evals yet — use the REST API.
// The SDK methods for datasets and datapoints are documented in the SDK reference.

How the eval executes

When you start a run, Traceway:

  1. Creates an eval run record with pending status.
  2. Loads all datapoints from the dataset.
  3. Transitions the run to running status.
  4. For each datapoint: a. Constructs the prompt. For LlmConversation datapoints, uses the messages array. For Generic datapoints, sends the input as a user message. b. If a system_prompt is configured, prepends it. c. Calls the model via the provider connection's API. d. Records the model's output. e. Scores the output against the expected value using the configured scoring strategy. f. Saves the result and increments the progress counter. g. Emits an SSE event with the updated progress.
  5. After all datapoints are processed, transitions the run to completed.

Each datapoint produces an EvalResult:

{
  "datapoint_id": "01J...",
  "output": "Paris",
  "expected": "Paris",
  "score": 1.0,
  "latency_ms": 340,
  "tokens": 28,
  "cost": 0.00004,
  "status": "completed",
  "error": null
}

Individual datapoint results have their own status. A datapoint can fail (e.g., model API error) without failing the entire run. The run continues with the next datapoint.

Result statusDescription
pendingNot yet processed
runningCurrently being processed
completedModel call and scoring succeeded
failedModel call or scoring failed (error message in error field)

Checking progress

Poll the run to check progress:

curl "https://api.traceway.ai/api/eval/${RUN_ID}" \
  -H "Authorization: Bearer tw_sk_..."

Response:

{
  "id": "01J...",
  "dataset_id": "01J...",
  "name": "gpt-4o-mini v2 prompt",
  "status": "running",
  "progress": 23,
  "total": 42,
  "config": { ... },
  "scoring": "ExactMatch",
  "result_items": [
    {
      "datapoint_id": "01J...",
      "output": "Paris",
      "score": 1.0,
      "latency_ms": 340,
      "status": "completed"
    }
  ],
  "created_at": "2024-06-15T12:00:00Z"
}

The result_items array grows as datapoints are processed. In the dashboard, this updates live via SSE.

Cancelling a run

Cancel a running eval:

curl -X POST "https://api.traceway.ai/api/eval/${RUN_ID}/cancel" \
  -H "Authorization: Bearer tw_sk_..."

Cancellation is graceful — the currently-processing datapoint finishes, then the run stops. Already-completed results are preserved.

Deleting a run

curl -X DELETE "https://api.traceway.ai/api/eval/${RUN_ID}" \
  -H "Authorization: Bearer tw_sk_..."

This deletes the run and all its results.

Listing runs for a dataset

curl "https://api.traceway.ai/api/datasets/${DATASET_ID}/eval" \
  -H "Authorization: Bearer tw_sk_..."

Returns all eval runs for the dataset, ordered by creation time (newest first).

On this page