Generation of Synthetic Text2SQL LLM data with 100% validity using Dataframer

Two Minute Overview

Highlights

We amplify a cheap LLM to produce validated and diverse Text2SQL
Generations are more complex and valid than seed examples
Quality through agentic planning, revisions, and diversity control
Execution-validated across Postgres/MySQL/SQLite
Dataset published on Huggingface

Background

Text-to-SQL systems translate natural language questions into executable database queries, enabling non-technical users to interact with databases directly. Building evaluation and training data for these systems requires SQL that actually runs, schemas that make sense, and enough diversity to cover real-world complexity.

We used Dataframer to generate 500 text-to-SQL samples that achieve 100% valid SQL across PostgreSQL, MySQL, and SQLite, using only Claude Haiku, one of the most cost-effective models available. The samples are high-quality because they undergo extensive automatic revisions and filtering. The resulting dataset is significantly more diverse and complex than the seed data we started with, and we’re releasing it on HuggingFace.

This post walks through how we did it and what makes our approach different.

SQL Tools in Dataframer

The starting point: Gretel’s synthetic_text_to_sql

We started with 100 samples from Gretel’s synthetic_text_to_sql dataset, a widely-used text-to-SQL training set with over 100,000 examples across 100 domains. It’s a solid dataset, but when we ran validation of schemas and queries, we found issues: queries referencing undefined tables, invalid schema formats.

In the table below, a schema or query is counted as “valid” if it executes successfully in at least one of the three databases. Some samples in the seed data work in MySQL but fail in SQLite, others work in PostgreSQL but not MySQL, and so on. This inconsistency across dialects makes the data less useful for training dialect-agnostic models. Our generated data is valid across all three.

Dataset	Schema Valid	Query Valid
Gretel seeds	97%	84%
Generated	100%	100%

From objectives to specification

Here’s what we wrote in Dataframer’s Generation Objectives field:

Generally include advanced SQL features with somewhat greater prevalence than in seed data.

Prompts should have much greater diversity across styles: direct questions, imperatives,
conversational, terse/keyword-style, complex multi-part ("Find X and also show Y grouped by Z"),
informal/ambiguous with vague terms like "recent"/"top"/"best".

Increase frequency of multiple joins, CTEs, set operations a bit. Conditional on query using
joins, at least 10% of such samples should have left/right/cross joins.

More samples with 3-5 tables compared with seed data, also have a bit more columns on average.

We don't need data definition type queries, but include all other types.

A bit more foreign/primary keys in tables.

That’s it. Six bullet points of natural language describing what we wanted.

From this, Dataframer’s spec generation automatically created a detailed specification with 27 data properties, complete with probability distributions and conditional dependencies. The system analyzed our seed data, inferred the axes of variation, and translated our objectives into a structured blueprint.

For example, the objective “Conditional on query using joins, at least 10% should have left/right/cross joins” became a conditional distribution where Join Type depends on Number of Tables:

Tables	inner join	left join	right join	cross join
2	70%	18%	6%	3%
3	55%	25%	8%	4%
4	45%	30%	10%	5%
5	40%	30%	12%	6%

You can see the full specifications here:

We refined the generated spec after reviewing it, removing dialect-specific features like RIGHT JOIN and TOP clause for cross-database compatibility and tuning some distributions. But the heavy lifting of identifying 27 relevant properties and their relationships was done automatically.

Quality through agentic pipelines, not bigger models

Here’s what surprised us most: we generated this entire dataset using only Claude Haiku, currently priced at $1 per million input tokens and $5 per million output tokens. That’s the smallest, fastest, cheapest model in Anthropic’s lineup. Yet the data quality is high: internally consistent samples, valid SQL, and proper conformance to the specified properties.

How do you get high-quality, internally-consistent SQL from a model optimized for speed over reasoning depth?

Dataframer uses a multi-stage agentic pipeline rather than relying on a single model call:

Outlining and modular generation - A planning stage creates a blueprint for each sample, then content is generated in sections, with each part informed by what came before. This ensures the schema, query, and explanation are internally consistent.
Revision cycles - Specialized agents review generated content for coherence, consistency, and conformance to the sampled properties. If a sample claims to use “aggregation” complexity but the query has no aggregation functions, it gets revised. Samples that remain inconsistent, invalid, or nonconformant after revision are filtered out and retried.
Programmatic validation - For SQL specifically, every schema and query is executed against SQLite, PostgreSQL, and MySQL. Invalid SQL doesn’t make it into the final dataset.

This architecture means we’re not asking Haiku to do everything in one shot. Each agent has a focused role, and the pipeline catches errors that any individual stage might introduce. A $1/million-token model produces output quality you’d normally associate with much more expensive options.

This isn’t specific to text-to-SQL. Dataframer’s agentic pipeline is general-purpose, designed to work across a wide variety of dataset types, data formats, and domains.

The result is generated data that surpasses the original seeds in three key ways:

Diversity through spec generation - controlled variation across prompt styles, SQL operations, and domain coverage
Quality through revision cycles - internally consistent samples that conform to specified properties
Validity through programmatic validation - every schema and query executes successfully

Diversity by design

The seed data we started with was heavily skewed toward simple queries:

Metric	Seeds	Generated	Improvement
Samples with 3-5 tables	2.5%	36%	14x
Non-inner joins (LEFT/CROSS)	0%	28%	-
Uses primary keys	7%	36%	5x

This isn’t random variation. It’s controlled diversity. Each sample is generated by first sampling from the property distributions in the spec, then generating content that matches those sampled attributes.

The spec controls not just SQL complexity but also prompt style diversity. Our generated dataset includes:

Direct questions: “What is the average delivery time for orders in Q3?”
Imperative commands: “List all products with inventory below threshold”
Conversational requests: “Can you show me which customers haven’t ordered recently?”
Terse keyword-style: “total sales by region 2024”
Complex multi-part: “Find the top 5 suppliers by volume and also show their average lead time grouped by category”
Informal/ambiguous: “what are the best performing stores lately”

This diversity matters for training robust text-to-SQL models that handle real user queries, not just well-formed textbook questions. And it’s automatically injected via the generated spec. You describe what you want in your objectives, the spec captures it as properties and distributions, and the generation pipeline produces diverse samples without manual effort.

Example Generations

Here are some of the generated samples from our dataset hosted on HuggingFace.

Domain	Prompt	SQL Complexity	SQL
Clinical Operations	Which clinical procedures are scheduled for staff members assigned to emergency and pediatric departments?	Single join	`SELECT scheduled_procedures.procedure_id, scheduled_procedures.procedure_name, clinical_staff.staff_id, clinical_staff.name, clinical_staff.department, scheduled_procedures.scheduled_date FROM scheduled_procedures INNER JOIN clinical_staff ON scheduled_procedures.staff_id = clinical_staff.staff_id WHERE clinical_staff.department IN ('Emergency', 'Pediatric');`
Smart Grid	Which substations serve which distribution zones with renewable energy and demand response?	Multiple left joins	`SELECT s.name, dz.zone_name, rs.source_type, dr.status FROM substations s LEFT JOIN distribution_zones dz ON s.zone_id = dz.id LEFT JOIN renewable_sources rs ON dz.renewable_source_id = rs.id LEFT JOIN smart_meters sm ON s.id = sm.substation_id LEFT JOIN demand_response dr ON sm.id = dr.meter_id;`

How this compares to other approaches

NVIDIA’s NeMo Data Designer is another framework for synthetic data generation that appears similar on the surface. But there are fundamental differences.

In Data Designer, “seed data” is an additional column that provides context for generation, essentially few-shot examples. In Dataframer, seeds define the distribution you want to match. The system analyzes your seeds to infer properties, relationships, and statistical patterns, then generates new data that follows those patterns while introducing controlled variation.

Data Designer requires you to manually configure each column with its own prompt template and generation logic. This is an iterative process: you write a prompt, test it, see what comes out, adjust the prompt, repeat. Often you don’t know what to write to get quality data until you’ve experimented extensively. Dataframer’s spec generation bypasses this entirely. You provide objectives in natural language, and the system generates a complete specification automatically.

Data Designer is a low-level toolkit for building data generation pipelines. Dataframer is a higher-level platform that handles the pipeline complexity for you. You describe what you want, and it figures out how to generate it.

Guaranteed validity through execution

Every schema and query in our generated dataset has been executed against SQLite, PostgreSQL, and MySQL. This isn’t just syntax checking, it’s actual execution.

Dataframer’s SQL validation supports three levels: syntax-only, syntax plus schema execution, and full execution of both schema and query. We used the full validation level. Queries that reference non-existent tables, use dialect-specific functions, or have any other execution errors are caught and either revised or filtered.

Dataframer’s generation agents can invoke tools during generation, not just SQL validators, but also code executors, format checkers, and custom validation functions.

Try it yourself

The dataset is available on HuggingFace

To generate your own text-to-SQL dataset or any other structured data with Dataframer:

Upload seed data - As few as 2 samples works. You can also go fully seedless: write objectives, generate a spec from scratch, and produce data without any seeds at all.
Write objectives - Describe what you want in natural language. Be specific about diversity requirements, complexity targets, or constraints.
Generate and refine spec - Review the auto-generated specification. Adjust distributions, add or remove properties, encode domain-specific constraints.
Run generation - Choose your model, enable revisions for quality, and set your sample count. For SQL datasets, schema and query columns are auto-detected and validation runs automatically.
Validate and iterate - Review the evaluation metrics, check sample quality, refine the spec if needed.

The complete workflow guide covers each step in detail.