Traceway
Evaluations

Configuration

Configure evaluation runs — models, scoring, parameters, and provider connections.

Every eval run is defined by two things: a model configuration (which model to call and how) and a scoring strategy (how to judge the outputs). This page covers all the configuration options.

Model configuration

The config object in an eval run request controls how Traceway calls the model for each datapoint.

{
  "config": {
    "connection_id": "01J...",
    "model": "gpt-4o-mini",
    "system_prompt": "You are a helpful assistant. Answer concisely.",
    "temperature": 0.0
  }
}

connection_id (required)

The ID of the provider connection to use. This determines the API endpoint and credentials.

model (required)

The model identifier passed to the provider's API. Examples:

ProviderModel examples
OpenAIgpt-4o, gpt-4o-mini, gpt-4-turbo, o1-mini
Anthropicclaude-sonnet-4-20250514, claude-3-haiku-20240307
Custom/Ollamallama3.1, mistral, codellama

The model must be available through the provider connection. If you specify a model the provider doesn't support, the eval will fail with an error on each datapoint.

system_prompt (optional)

A system prompt prepended to every datapoint's input. This is useful when you want to test different system prompts against the same dataset.

For LlmConversation datapoints that already include a system message, the eval config's system_prompt is prepended before the existing messages. If a datapoint already has a system message and you don't want to override it, leave system_prompt unset.

For Generic datapoints, the system prompt becomes the first message, and the datapoint's input is sent as the user message.

temperature (optional)

Controls the model's randomness. Range: 0.0 to 2.0. Default: provider-specific (usually 1.0).

For eval runs, set temperature to 0.0 unless you have a specific reason not to. Non-zero temperatures introduce randomness, which means running the same eval twice may produce different scores. This makes comparisons unreliable.

{
  "config": {
    "connection_id": "01J...",
    "model": "gpt-4o-mini",
    "temperature": 0.0
  }
}

Scoring strategy

The scoring field determines how Traceway scores each result. Set it at the top level of the eval run request.

StrategyDescriptionScore range
ExactMatchOutput must exactly equal expected output (case-sensitive)0.0 or 1.0
ContainsOutput must contain expected output as substring (case-insensitive)0.0 or 1.0
LlmJudgeA second LLM call rates the output's quality0.0 to 1.0
NoneNo scoring — outputs are recorded but not judgednull

See Scoring Strategies for detailed behavior, examples, and tradeoffs.

Choosing a strategy

  • Use ExactMatch for classification, yes/no, or structured outputs where exact precision matters.
  • Use Contains for factual Q&A where the model may include extra context around the answer.
  • Use LlmJudge for natural language tasks where multiple phrasings are valid.
  • Use None when you plan to review results manually or send them to human evaluators.

Provider connections

Provider connections store the credentials and endpoint configuration for a model provider. You create them once and reference them by ID in eval runs.

Creating a connection

Dashboard: Settings > Provider Connections > "New Connection".

API:

curl -X POST https://api.traceway.ai/api/provider-connections \
  -H "Authorization: Bearer tw_sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "openai-prod",
    "provider": "openai",
    "api_key": "sk-proj-...",
    "base_url": "https://api.openai.com/v1",
    "default_model": "gpt-4o-mini"
  }'

Connection fields

FieldRequiredDescription
nameYesHuman-readable label (e.g., "openai-prod", "anthropic-test")
providerYesopenai, anthropic, or custom
api_keyYesAPI key for the provider
base_urlNoOverride the default API endpoint
default_modelNoModel to use when the eval config doesn't specify one

Supported providers

OpenAI — Works with the standard OpenAI API. Set provider: "openai" and provide your OpenAI API key.

Anthropic — Uses the Anthropic messages API. Set provider: "anthropic" and provide your Anthropic API key.

Custom (OpenAI-compatible) — For self-hosted models, Ollama, Azure OpenAI, or any endpoint that implements the OpenAI chat completions format. Set provider: "custom" and provide the base_url:

curl -X POST https://api.traceway.ai/api/provider-connections \
  -H "Authorization: Bearer tw_sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "ollama-local",
    "provider": "custom",
    "api_key": "not-needed",
    "base_url": "http://localhost:11434/v1",
    "default_model": "llama3.1"
  }'

Security

Provider connections are stored encrypted. The API key is never returned in API responses after creation. You can list connections (which returns names and IDs but not keys) and delete them, but you cannot read back the key.

Timeout and retry behavior

Traceway applies the following defaults when calling the model for each datapoint:

SettingDefaultDescription
Request timeout60 secondsMaximum time to wait for a model response
Retries1Number of retries on transient errors (429, 500, 502, 503)
Retry delay2 secondsWait time between retries

If a datapoint fails after all retries, it's marked as failed with the error message. The eval run continues with the next datapoint — a single failure doesn't stop the entire run.

These defaults are not currently configurable per-run. If you need different timeout behavior, use a provider connection with a custom base_url that proxies requests with your own timeout settings.

Full example

Putting it all together — a complete eval run request:

curl -X POST "https://api.traceway.ai/api/datasets/${DATASET_ID}/eval" \
  -H "Authorization: Bearer tw_sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "gpt-4o-mini v3 prompt regression test",
    "config": {
      "connection_id": "01J...",
      "model": "gpt-4o-mini",
      "system_prompt": "You are a concise assistant. Answer factual questions in 1-2 sentences. If unsure, say so.",
      "temperature": 0.0
    },
    "scoring": "Contains"
  }'

On this page