Evaluate with what you have.
Generate what you're missing.

Two ways to evaluate in DataFramer: score your real production traces against calibrated judges, or generate synthetic test cases from those same traces and evaluate those.

Start free (no card) Talk to us

Generic benchmarks miss your actual failure modes.

Most eval datasets were built for general capabilities. Your failures are specific to your prompts, your retrieval, your users. A benchmark that doesn't reflect that tells you almost nothing.

Off-the-shelf judges haven't seen your domain.

We found that teams using uncalibrated judges were consistently wrong on a whole class of domain-specific failures. Nobody caught it until a human reviewer flagged it weeks later.

Real traces can't cover failures you haven't seen yet.

Production data is useful for known failures. But if a failure mode hasn't shown up yet, you have no test cases for it. Waiting for it to appear in production is not a testing strategy.

Evals rarely connect to the fixes they're supposed to validate.

Customers told us evals and fixes happened in separate workflows. By the time a fix shipped, nobody could say whether it had actually addressed the failure that triggered the eval.

Evaluate with real traces. Generate what production can't give you.

Path 1

Evaluate with your traces

Pull traces from production and score them against rubrics your team defined, using judges calibrated to your human reviewers. Measure agreement. Build regression suites from cases that mattered. Before a fix ships, test it against the real failures that caused the problem.

Connect Langfuse, LangSmith, or send traces via SDK
Score traces against rubrics using calibrated judges
Measure judge-to-human agreement before trusting at scale
Build regression datasets from expert-reviewed traces
Test every fix against the failures that triggered it

Path 2

Generate what production can't give you

Pick real traces as the starting point. DataFramer generates synthetic test cases that reflect your actual domain. Add them to eval runs, test against known failure patterns, and cover edge cases before they show up in production.

Pick production traces as seeds from the Traces table
Generate synthetic samples in JSONL, CSV, XLSX, PDF, and more
Cover failure patterns your real data doesn't include yet
Run evaluations against generated datasets on demand
Combine with judge runs to get agreement scores on synthetic cases

From traces to tested fixes, end to end.

01

Bring in your traces

Connect Langfuse or LangSmith, or send traces directly via the DataFramer SDK. User feedback, corrections, and ratings can come in alongside traces.

Ingest

02

Pick traces as the starting point

From the Traces table, choose the rows that best represent your domain. Add them to a seed dataset in one step.

Seed datasets

03

Describe what you want to generate

A spec captures the structure, properties, and distributions of the dataset. You can define it yourself or let DataFramer infer it from your example data.

Specs

04

Generate synthetic test cases

Run generation from the spec. Output formats include JSONL, CSV, XLSX, PDF, and multi-folder samples. Generated data can cover failure patterns your real traces don't include yet.

Runs

05

Assemble an eval dataset for your judge

Combine real traces with generated ones, or use expert-reviewed traces only. Each dataset ties specific rubrics to the traces being scored.

Judge datasets

06

Score outputs and measure agreement

Pick your model and judge prompt. DataFramer scores each trace and shows how closely the judge agrees with your human reviewers. Check this before relying on the judge at scale.

Judge runs

Start with your real traces.
Build from there.

Start free (no card) Talk to us