Traceway
API Reference

Evaluations

REST API endpoints for running evaluations, checking results, comparing runs.

Start an eval run

POST /api/datasets/:id/eval
{
  "name": "gpt-4o-mini baseline",
  "config": {
    "connection_id": "01J...",
    "model": "gpt-4o-mini",
    "system_prompt": "You are a helpful assistant.",
    "temperature": 0.0
  },
  "scoring": "ExactMatch"
}
FieldRequiredTypeDescription
nameYesstringName for this run
config.connection_idYesstringProvider connection ID
config.modelYesstringModel identifier
config.system_promptNostringSystem prompt prepended to every datapoint
config.temperatureNonumberTemperature (0.0 to 2.0)
scoringYesstringExactMatch, Contains, LlmJudge, or None

The eval runs asynchronously. The response returns immediately with the run metadata:

{
  "id": "01J...",
  "dataset_id": "01J...",
  "name": "gpt-4o-mini baseline",
  "status": "pending",
  "progress": 0,
  "total": 42,
  "config": { ... },
  "scoring": "ExactMatch",
  "result_items": [],
  "created_at": "2024-06-15T12:00:00Z"
}

List eval runs

GET /api/datasets/:id/eval

Returns all eval runs for the dataset, ordered by creation time (newest first).

Get eval results

GET /api/eval/:run_id

Returns the run metadata plus all result items:

{
  "id": "01J...",
  "dataset_id": "01J...",
  "name": "gpt-4o-mini baseline",
  "status": "completed",
  "progress": 42,
  "total": 42,
  "config": { ... },
  "scoring": "ExactMatch",
  "result_items": [
    {
      "datapoint_id": "01J...",
      "output": "Paris",
      "expected": "Paris",
      "score": 1.0,
      "latency_ms": 340,
      "tokens": 28,
      "cost": 0.00004,
      "status": "completed",
      "error": null
    }
  ],
  "created_at": "2024-06-15T12:00:00Z"
}

Run status values

StatusDescription
pendingCreated but not started
runningProcessing datapoints
completedAll datapoints processed
failedUnrecoverable error
cancelledCancelled by user

Result item status values

StatusDescription
pendingNot yet processed
runningCurrently processing
completedModel call and scoring succeeded
failedModel call or scoring failed

Compare eval runs

GET /api/datasets/:id/compare?runs=run_id_1,run_id_2

Compare up to 4 runs side by side. All runs must belong to the same dataset.

Returns:

{
  "dataset_id": "01J...",
  "runs": [
    { "id": "01J...", "name": "baseline", "avg_score": 0.76, "total_cost": 0.042 },
    { "id": "01J...", "name": "v2 prompt", "avg_score": 0.83, "total_cost": 0.045 }
  ],
  "comparisons": [
    {
      "datapoint_id": "01J...",
      "input": { ... },
      "expected": "Paris",
      "results": [
        { "run_id": "01J...", "output": "Paris", "score": 1.0, "latency_ms": 320 },
        { "run_id": "01J...", "output": "Paris", "score": 1.0, "latency_ms": 290 }
      ]
    }
  ]
}

Cancel an eval run

POST /api/eval/:run_id/cancel

Gracefully cancels a running eval. The current datapoint finishes, then the run stops. Already-completed results are preserved.

Returns 200 on success. Returns 400 if the run is not in running status.

Delete an eval run

DELETE /api/eval/:run_id

Deletes the run and all its results. Returns 200 on success, 404 if not found.

On this page