API Reference
Evaluations
REST API endpoints for running evaluations, checking results, comparing runs.
Start an eval run
POST /api/datasets/:id/eval{
"name": "gpt-4o-mini baseline",
"config": {
"connection_id": "01J...",
"model": "gpt-4o-mini",
"system_prompt": "You are a helpful assistant.",
"temperature": 0.0
},
"scoring": "ExactMatch"
}| Field | Required | Type | Description |
|---|---|---|---|
name | Yes | string | Name for this run |
config.connection_id | Yes | string | Provider connection ID |
config.model | Yes | string | Model identifier |
config.system_prompt | No | string | System prompt prepended to every datapoint |
config.temperature | No | number | Temperature (0.0 to 2.0) |
scoring | Yes | string | ExactMatch, Contains, LlmJudge, or None |
The eval runs asynchronously. The response returns immediately with the run metadata:
{
"id": "01J...",
"dataset_id": "01J...",
"name": "gpt-4o-mini baseline",
"status": "pending",
"progress": 0,
"total": 42,
"config": { ... },
"scoring": "ExactMatch",
"result_items": [],
"created_at": "2024-06-15T12:00:00Z"
}List eval runs
GET /api/datasets/:id/evalReturns all eval runs for the dataset, ordered by creation time (newest first).
Get eval results
GET /api/eval/:run_idReturns the run metadata plus all result items:
{
"id": "01J...",
"dataset_id": "01J...",
"name": "gpt-4o-mini baseline",
"status": "completed",
"progress": 42,
"total": 42,
"config": { ... },
"scoring": "ExactMatch",
"result_items": [
{
"datapoint_id": "01J...",
"output": "Paris",
"expected": "Paris",
"score": 1.0,
"latency_ms": 340,
"tokens": 28,
"cost": 0.00004,
"status": "completed",
"error": null
}
],
"created_at": "2024-06-15T12:00:00Z"
}Run status values
| Status | Description |
|---|---|
pending | Created but not started |
running | Processing datapoints |
completed | All datapoints processed |
failed | Unrecoverable error |
cancelled | Cancelled by user |
Result item status values
| Status | Description |
|---|---|
pending | Not yet processed |
running | Currently processing |
completed | Model call and scoring succeeded |
failed | Model call or scoring failed |
Compare eval runs
GET /api/datasets/:id/compare?runs=run_id_1,run_id_2Compare up to 4 runs side by side. All runs must belong to the same dataset.
Returns:
{
"dataset_id": "01J...",
"runs": [
{ "id": "01J...", "name": "baseline", "avg_score": 0.76, "total_cost": 0.042 },
{ "id": "01J...", "name": "v2 prompt", "avg_score": 0.83, "total_cost": 0.045 }
],
"comparisons": [
{
"datapoint_id": "01J...",
"input": { ... },
"expected": "Paris",
"results": [
{ "run_id": "01J...", "output": "Paris", "score": 1.0, "latency_ms": 320 },
{ "run_id": "01J...", "output": "Paris", "score": 1.0, "latency_ms": 290 }
]
}
]
}Cancel an eval run
POST /api/eval/:run_id/cancelGracefully cancels a running eval. The current datapoint finishes, then the run stops. Already-completed results are preserved.
Returns 200 on success. Returns 400 if the run is not in running status.
Delete an eval run
DELETE /api/eval/:run_idDeletes the run and all its results. Returns 200 on success, 404 if not found.