Synthetic Text-to-SQL Data Generation with 100% SQL Validity Using Claude Haiku
From production SQL failure traces to 500 labeled, execution-validated eval samples — using only Claude Haiku.
Alex Lyzhov
Highlights
- We amplify a cheap LLM to produce validated and diverse Text2SQL
- Generations are more complex and valid than seed examples
- Quality through agentic planning, revisions, and diversity control
- Execution-validated across Postgres/MySQL/SQLite
- Dataset published on Huggingface
Background
Text-to-SQL fails quietly. A query comes back, the system looks healthy, but the SQL references a table that doesn’t exist or returns results that are subtly wrong. By the time someone catches it, the problem has been there a while.
Building eval coverage for this is harder than it looks. You usually have some signal — production traces where the SQL was wrong or borderline — but not nearly enough labeled pairs to test systematically. Writing hundreds of schema-query-explanation triples by hand is slow. Most approaches to generating them produce SQL that’s too simple or fails validation across dialects.
We used DataFramer to go from a small set of production signal traces to 500 labeled, execution-validated eval pairs using only Claude Haiku. The dataset is on HuggingFace. This post explains what we did.

The starting point: production signal traces
In practice, you’d start with traces flagged through review: SQL outputs that failed, looked wrong, or represented patterns you want test coverage for. For this experiment we used 100 samples from Gretel’s synthetic_text_to_sql dataset as a stand-in for that kind of production signal. When we validated the seed data, we found the same issues you’d encounter in real traces: queries referencing undefined tables, invalid schema formats.
In the table below, a schema or query is counted as “valid” if it executes successfully in at least one of the three databases. Some samples in the seed data work in MySQL but fail in SQLite, others work in PostgreSQL but not MySQL, and so on. This inconsistency across dialects makes the data less useful for training dialect-agnostic models. Our generated data is valid across all three.
| Dataset | Schema Valid | Query Valid |
|---|---|---|
| Gretel seeds | 97% | 84% |
| Generated | 100% | 100% |
From objectives to specification
Here’s what we wrote in DataFramer’s Generation Objectives field:
Generally include advanced SQL features with somewhat greater prevalence than in seed data.
Prompts should have much greater diversity across styles: direct questions, imperatives,
conversational, terse/keyword-style, complex multi-part ("Find X and also show Y grouped by Z"),
informal/ambiguous with vague terms like "recent"/"top"/"best".
Increase frequency of multiple joins, CTEs, set operations a bit. Conditional on query using
joins, at least 10% of such samples should have left/right/cross joins.
More samples with 3-5 tables compared with seed data, also have a bit more columns on average.
We don't need data definition type queries, but include all other types.
A bit more foreign/primary keys in tables.
Six bullet points of natural language. That was the entire input.
From this, DataFramer’s spec generation automatically created a detailed specification with 27 data properties, complete with probability distributions and conditional dependencies. It analyzed the seed data and turned our objectives into a spec with explicit property distributions.
For example, the objective “Conditional on query using joins, at least 10% should have left/right/cross joins” became a conditional distribution where Join Type depends on Number of Tables:
| Tables | inner join | left join | right join | cross join |
|---|---|---|---|---|
| 2 | 70% | 18% | 6% | 3% |
| 3 | 55% | 25% | 8% | 4% |
| 4 | 45% | 30% | 10% | 5% |
| 5 | 40% | 30% | 12% | 6% |
You can see the full specifications here:
We refined the generated spec after reviewing it, removing dialect-specific features like RIGHT JOIN and TOP clause for cross-database compatibility and tuning some distributions. But the heavy lifting of identifying 27 relevant properties and their relationships was done automatically.
Quality through agentic pipelines, not bigger models
We generated this entire dataset using only Claude Haiku, currently $1 per million input tokens, the cheapest model in Anthropic’s lineup. The quality comes from how the pipeline is structured, not from using a more capable model.
DataFramer uses a multi-stage pipeline rather than a single model call:
-
Outlining and modular generation: A planning stage creates a blueprint for each sample, then content is generated in sections, with each part informed by what came before. This ensures the schema, query, and explanation are internally consistent.
-
Revision cycles: Specialized agents review generated content for coherence, consistency, and conformance to the sampled properties. If a sample claims to use “aggregation” complexity but the query has no aggregation functions, it gets revised. Samples that remain inconsistent, invalid, or nonconformant after revision are filtered out and retried.
-
Programmatic validation: For SQL specifically, every schema and query is executed against SQLite, PostgreSQL, and MySQL. Invalid SQL doesn’t make it into the final dataset.
Each stage has a focused role, and the pipeline catches errors that any single stage might introduce. The result across three dimensions:
- Diversity through spec generation: controlled variation across prompt styles, SQL operations, and domain coverage
- Quality through revision cycles: samples that match the properties specified and stay internally consistent
- Validity through programmatic validation: every schema and query executes successfully
Diversity by design
The seed data we started with was heavily skewed toward simple queries:
| Metric | Seeds | Generated | Improvement |
|---|---|---|---|
| Samples with 3-5 tables | 2.5% | 36% | 14x |
| Non-inner joins (LEFT/CROSS) | 0% | 28% | - |
| Uses primary keys | 7% | 36% | 5x |
Each sample is generated by first sampling from the property distributions in the spec, then generating content that matches those attributes.
The spec controls not just SQL complexity but also prompt style diversity. Our generated dataset includes:
- Direct questions: “What is the average delivery time for orders in Q3?”
- Imperative commands: “List all products with inventory below threshold”
- Conversational requests: “Can you show me which customers haven’t ordered recently?”
- Terse keyword-style: “total sales by region 2024”
- Complex multi-part: “Find the top 5 suppliers by volume and also show their average lead time grouped by category”
- Informal/ambiguous: “what are the best performing stores lately”
This matters for eval datasets meant to surface failures in real production queries, not just textbook SQL. You describe what you want in your objectives and the spec handles the diversity mechanically, without manual engineering.
How this compares to other approaches
NVIDIA’s NeMo Data Designer is another synthetic data generation framework. There are real differences worth understanding.
In Data Designer, “seed data” is an additional column that provides context for generation, essentially few-shot examples. In DataFramer, seeds define the distribution you want to match. The system analyzes your seeds to infer properties, relationships, and statistical patterns, then generates new data that follows those patterns while introducing controlled variation.
Data Designer requires you to manually configure each column with its own prompt template and generation logic. This is an iterative process: you write a prompt, test it, see what comes out, adjust the prompt, repeat. Often you don’t know what to write to get quality data until you’ve experimented extensively. DataFramer’s spec generation bypasses this entirely. You provide objectives in natural language, and the system generates a complete specification automatically.
Data Designer is a lower-level toolkit where you build and configure the generation pipeline manually. DataFramer generates the spec and runs the pipeline for you from natural language objectives. For more on the design choices behind that, see why DataFramer.
Guaranteed validity through execution
Every schema and query in our generated dataset has been executed against SQLite, PostgreSQL, and MySQL. That means full execution, not syntax checking alone.
DataFramer’s built-in SQL verifier supports three validation levels: syntax-only, syntax plus schema execution, and full execution of both schema and query. We used the full validation level. Queries that reference non-existent tables, use dialect-specific functions, or have any other execution errors are caught and either revised or filtered.
DataFramer’s generation agents can invoke tools during generation: SQL validators, code executors, format checkers, and custom validation functions.
Try it yourself
The dataset is available on HuggingFace
To generate your own text-to-SQL dataset or any other structured data with DataFramer:
-
Upload seed data: As few as 2 samples works. You can also go fully seedless: write objectives, generate a spec from scratch, and produce data without any seeds at all.
-
Write objectives: Describe what you want in natural language. Be specific about diversity requirements, complexity targets, or constraints.
-
Generate and refine spec: Review the auto-generated specification. Adjust distributions, add or remove properties, encode domain-specific constraints.
-
Run generation: Choose your model, enable revisions for quality, and set your sample count. For SQL datasets, schema and query columns are auto-detected and validation runs automatically.
-
Validate and iterate: Review the evaluation metrics, check sample quality, refine the spec if needed.
The complete workflow guide covers each step in detail.
How a 3B Model Outperformed GPT-4o on Hallucination Detection: The Training, Evals, Validation, and Benchmark Synthetic Data Pipeline Behind HDM-2
A 3B open-source model beat GPT-4o at hallucination detection, built on purpose-built training and eval data.
Alex Lyzhov Long-Form Synthetic Data Generation: Same LLM, Dramatically Different Results
Same LLM, dramatically different results. DataFramer vs raw Claude on 50K-token document generation.
Alex Lyzhov Get started
Ready to build better AI with better data?
The real bottleneck in AI isn't intelligence. It's the data you can't generate, can't share, or can't trust.