Configuration
Configure evaluation runs — models, scoring, parameters, and provider connections.
Every eval run is defined by two things: a model configuration (which model to call and how) and a scoring strategy (how to judge the outputs). This page covers all the configuration options.
Model configuration
The config object in an eval run request controls how Traceway calls the model for each datapoint.
{
"config": {
"connection_id": "01J...",
"model": "gpt-4o-mini",
"system_prompt": "You are a helpful assistant. Answer concisely.",
"temperature": 0.0
}
}connection_id (required)
The ID of the provider connection to use. This determines the API endpoint and credentials.
model (required)
The model identifier passed to the provider's API. Examples:
| Provider | Model examples |
|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, o1-mini |
| Anthropic | claude-sonnet-4-20250514, claude-3-haiku-20240307 |
| Custom/Ollama | llama3.1, mistral, codellama |
The model must be available through the provider connection. If you specify a model the provider doesn't support, the eval will fail with an error on each datapoint.
system_prompt (optional)
A system prompt prepended to every datapoint's input. This is useful when you want to test different system prompts against the same dataset.
For LlmConversation datapoints that already include a system message, the eval config's system_prompt is prepended before the existing messages. If a datapoint already has a system message and you don't want to override it, leave system_prompt unset.
For Generic datapoints, the system prompt becomes the first message, and the datapoint's input is sent as the user message.
temperature (optional)
Controls the model's randomness. Range: 0.0 to 2.0. Default: provider-specific (usually 1.0).
For eval runs, set temperature to 0.0 unless you have a specific reason not to. Non-zero temperatures introduce randomness, which means running the same eval twice may produce different scores. This makes comparisons unreliable.
{
"config": {
"connection_id": "01J...",
"model": "gpt-4o-mini",
"temperature": 0.0
}
}Scoring strategy
The scoring field determines how Traceway scores each result. Set it at the top level of the eval run request.
| Strategy | Description | Score range |
|---|---|---|
ExactMatch | Output must exactly equal expected output (case-sensitive) | 0.0 or 1.0 |
Contains | Output must contain expected output as substring (case-insensitive) | 0.0 or 1.0 |
LlmJudge | A second LLM call rates the output's quality | 0.0 to 1.0 |
None | No scoring — outputs are recorded but not judged | null |
See Scoring Strategies for detailed behavior, examples, and tradeoffs.
Choosing a strategy
- Use
ExactMatchfor classification, yes/no, or structured outputs where exact precision matters. - Use
Containsfor factual Q&A where the model may include extra context around the answer. - Use
LlmJudgefor natural language tasks where multiple phrasings are valid. - Use
Nonewhen you plan to review results manually or send them to human evaluators.
Provider connections
Provider connections store the credentials and endpoint configuration for a model provider. You create them once and reference them by ID in eval runs.
Creating a connection
Dashboard: Settings > Provider Connections > "New Connection".
API:
curl -X POST https://api.traceway.ai/api/provider-connections \
-H "Authorization: Bearer tw_sk_..." \
-H "Content-Type: application/json" \
-d '{
"name": "openai-prod",
"provider": "openai",
"api_key": "sk-proj-...",
"base_url": "https://api.openai.com/v1",
"default_model": "gpt-4o-mini"
}'Connection fields
| Field | Required | Description |
|---|---|---|
name | Yes | Human-readable label (e.g., "openai-prod", "anthropic-test") |
provider | Yes | openai, anthropic, or custom |
api_key | Yes | API key for the provider |
base_url | No | Override the default API endpoint |
default_model | No | Model to use when the eval config doesn't specify one |
Supported providers
OpenAI — Works with the standard OpenAI API. Set provider: "openai" and provide your OpenAI API key.
Anthropic — Uses the Anthropic messages API. Set provider: "anthropic" and provide your Anthropic API key.
Custom (OpenAI-compatible) — For self-hosted models, Ollama, Azure OpenAI, or any endpoint that implements the OpenAI chat completions format. Set provider: "custom" and provide the base_url:
curl -X POST https://api.traceway.ai/api/provider-connections \
-H "Authorization: Bearer tw_sk_..." \
-H "Content-Type: application/json" \
-d '{
"name": "ollama-local",
"provider": "custom",
"api_key": "not-needed",
"base_url": "http://localhost:11434/v1",
"default_model": "llama3.1"
}'Security
Provider connections are stored encrypted. The API key is never returned in API responses after creation. You can list connections (which returns names and IDs but not keys) and delete them, but you cannot read back the key.
Timeout and retry behavior
Traceway applies the following defaults when calling the model for each datapoint:
| Setting | Default | Description |
|---|---|---|
| Request timeout | 60 seconds | Maximum time to wait for a model response |
| Retries | 1 | Number of retries on transient errors (429, 500, 502, 503) |
| Retry delay | 2 seconds | Wait time between retries |
If a datapoint fails after all retries, it's marked as failed with the error message. The eval run continues with the next datapoint — a single failure doesn't stop the entire run.
These defaults are not currently configurable per-run. If you need different timeout behavior, use a provider connection with a custom base_url that proxies requests with your own timeout settings.
Full example
Putting it all together — a complete eval run request:
curl -X POST "https://api.traceway.ai/api/datasets/${DATASET_ID}/eval" \
-H "Authorization: Bearer tw_sk_..." \
-H "Content-Type: application/json" \
-d '{
"name": "gpt-4o-mini v3 prompt regression test",
"config": {
"connection_id": "01J...",
"model": "gpt-4o-mini",
"system_prompt": "You are a concise assistant. Answer factual questions in 1-2 sentences. If unsure, say so.",
"temperature": 0.0
},
"scoring": "Contains"
}'