Traceway

Traceway

Open-source observability for LLM applications. Record every call, debug failures, build test datasets from production data.

Traceway records traces and spans from your LLM application, so you can see exactly what happened on every request: which model was called, what the input was, what came back, how long it took, and how much it cost.

It runs as a single Rust binary. You can self-host it locally with SQLite, or deploy it to the cloud with Turbopuffer and Postgres.

Why Traceway

LLM applications are hard to debug. A single user request might chain multiple model calls, retrieve documents, invoke tools, and format the final response. When something goes wrong — a hallucination, a slow response, an unexpected refusal — you need to see the full picture.

Traceway gives you that picture. Every step in your pipeline becomes a span, organized into a trace. You see the exact prompts sent, the exact completions returned, the token counts, the latency, and the cost. No sampling, no aggregation — every request, recorded in full.

What you get

Traces and spans

Every LLM call, tool invocation, retrieval step, and custom operation recorded as a structured span inside a trace. Spans form a tree, so you can see parent-child relationships between steps.

Cost and latency tracking

Token counts and estimated costs per span, rolled up per trace. Traceway ships with a pricing table covering 50+ models across OpenAI, Anthropic, and other providers. Latency is recorded per-span with millisecond precision.

Datasets

Collect input/output pairs from production spans into datasets. Use them as regression test suites. Supports two kinds of datapoints: generic key-value pairs and LLM conversation threads.

Evaluations

Run your dataset through a model configuration and score the results. Four scoring strategies: exact match, substring contains, LLM-as-judge, or no scoring (manual review). Compare multiple eval runs side-by-side to measure the impact of prompt or model changes.

Capture rules

Automatically save spans to a dataset when they match a filter. Set rules like "save every gpt-4o span that costs more than $0.01" with configurable sample rates. This lets you build datasets from production traffic without manual effort.

Review queue

Human-in-the-loop workflow for labeling, correcting, and reviewing datapoints. Enqueue items, claim them, edit the data, and submit. Useful for building golden datasets and reviewing edge cases.

Real-time events

The dashboard updates live as spans come in via Server-Sent Events. No polling. You can also subscribe to events programmatically from your own code.

Proxy

An optional transparent HTTP proxy that sits in front of your LLM provider. Point your OpenAI base URL at the proxy and it records spans automatically — no SDK integration needed.

Architecture

Your app  ──SDK──>  Traceway API  ──>  Storage (SQLite or Turbopuffer)

                   Dashboard UI
                 (platform.traceway.ai)
  • API server — Rust binary, runs on port 3000. Handles trace/span ingestion, dataset CRUD, eval execution, and real-time events.
  • Proxy — Optional. Sits in front of your LLM provider (port 3001) and automatically records spans. You point your OpenAI base URL at it instead of api.openai.com.
  • Dashboard — SvelteKit SPA at platform.traceway.ai. Shows traces, spans, datasets, eval results, analytics, and settings.
  • Storage — SQLite for local dev, Turbopuffer for cloud. Auth data lives in Postgres (cloud only).

Next steps

On this page