Long-Form Synthetic Data Generation: Same LLM, Dramatically Different Results

Same LLM, dramatically different results. DataFramer vs raw Claude on 50K-token document generation.

Long-Form Text Generation Comparison

Alex Lyzhov

Mon Jan 12

Summary

We generated 50K-token documents using the same frontier LLM and got dramatically different results. In the baseline, outputs collapsed into short, repetitive “summary essays”. With DataFramer’s scaffold, we consistently got full-length, style-faithful documents across 4 long-text datasets.

What surprised us most (real examples):

  • Real Estate: baseline repeated “Zoning” 8 times; DataFramer produced 15 distinct topics from 5 inputs.
  • Gutenberg: baseline reused the same plot loop; DataFramer generated genuinely varied stories with strong prose.
  • Wiki Medical: baseline got shorter and added unwanted Markdown; DataFramer stayed long and encyclopedic.

Read on for:

  • The exact evaluation setup (blind, Gemini 3 Pro)
  • The three failure modes (mode collapse, style drift, length shrinkage) and how scaffolding prevents them
  • Full outputs + scripts for reproducing our results

Introduction

Teams running long-form AI generation in production see the same failures in their traces: outputs that repeat the same topics, drift from the expected style, or come back much shorter than required. Getting eval coverage for those failures is where things get complicated.

Long-form synthetic text generation (10K-100K tokens per sample) is hard because the tool you’d use to generate eval data has the same failure modes you’re trying to test. Raw LLMs commit to tokens without lookahead, can’t revise earlier sections, and drift or repeat as context grows. You need outlining, state tracking, iterative revision, and verification loops to produce eval data that actually covers mode collapse, style drift, and length shrinkage.

What is DataFramer?

DataFramer starts from production traces: failures and interesting AI outputs you’ve found through review. It analyzes them to build a spec that captures the properties and variation in your data, then generates labeled eval samples from that spec using outlining, evaluation loops, and revision cycles. For more details, see our documentation.

DataFramer Pipeline

Experiment Setup

Datasets

We manually collected 4 datasets with long texts as seeds for style and formatting conditioning:

DatasetSourceNumber of SamplesSample LengthDescription
WikisourceDownload2 texts35k-50k tokens”Results and Prospects” (Trotsky) + “The Time Machine” (H.G. Wells)
GutenbergDownload2 texts45k-50k tokens”The Call of the Wild” (Jack London) + “The Time Machine” (H.G. Wells)
Wiki MedicalDownload2 articles25k-30k tokens”Healthcare in Canada” + “History of Medicine”
Wiki Real EstateDownload5 articles5k-15k tokensNIMBY, Real Estate Economics, Intellectual Property, Property Management, REITs

Generation Protocol

We generated up to 15 samples for each dataset. Generation followed the standard flow and was almost completely hands-off:

  1. Load seed data
  2. Generate a spec using Claude Sonnet 4.5, a blueprint that captures the structure, patterns, and requirements of your data
  3. Make minimal spec edits (see below)
  4. Generate samples

The spec edits were trivial: we only made changes in 2 places across all 4 datasets. See the before/after specs:

DatasetGenerated SpecEdited SpecChange
Wiki Medicalspec(same)No changes
WikisourcebeforeafterRemoved last sentence about length (platform determines length automatically)
Wiki Real Estatespec(same)No changes
GutenbergbeforeafterChanged “from” to “resembling those from” (we want new fiction, not reproductions)

There was no cherrypicking: we did not select datasets where DataFramer performs well, nor make algorithm changes for these datasets. All seed datasets, generation specs, and scripts are included for reproducibility.

Baseline

As of January 2026, to the best of our knowledge, we have not identified a commercially available system that provides comparable general-purpose generation of diverse long-form texts. Therefore, we compare against a raw frontier LLM baseline: Claude Sonnet 4.5 with low reasoning mode (1024 tokens of reasoning). DataFramer uses the same model for all its internal roles (outlining, generation, filtering, revision). The only difference between the two methods is our agentic framework.

Evaluation Methodology

We designed the LLM evaluation framework and eval harness to be maximally fair:

  • Systems anonymized as “System 1” (DataFramer) and “System 2” (baseline Claude Sonnet 4.5), same number of samples for each
  • We used an LLM as a judge approach with an independent evaluator from a different model family: Gemini 3 Pro Preview with high reasoning mode
  • Evaluator received all samples from both systems in one context window and compared them across 7 dimensions: Diversity, Style Distribution Matching, Length, Quality, Artifacts, Validity, and Overall Assessment

Results

At a Glance

DatasetDataFramerBaseline (Sonnet 4.5)
WikisourceFull novellas with compelling plots, authentic period voicesOnly produced dry essays, ignored fiction entirely
GutenbergSuperb prose quality, massive creativityPlot loop - same expedition story repeated
Wiki Real Estate15 unique topics from 5 inputs, perfect style match8x “Zoning”, 4x “Land Value Tax”
Wiki MedicalLong-context coherence, encyclopedic depthToo short, added unwanted Markdown formatting

The same model, the same seeds. The only difference is DataFramer’s agentic scaffold.

Deep Dive: Wikisource

CriterionDataFramerSonnet 4.5 Baseline
DiversityExceptional - political treatises, epistolary novels, sci-fi, utopias. Creatively merges both inputs.Very low - nearly all dry expository essays. No fiction, no dialogue. Repetitive titles.
Style DistributionMatches both input styles. Reproduces Wikisource formatting (nav arrows, metadata). Authentic period voices.Fails - homogenizes everything into generic “academic” voice.
LengthMassive long-form content - full novellas with Preface to Epilogue structure.Short-medium essays, summary-based, lacking depth.
QualityExtraordinary - compelling plots, character arcs, authentic world-building.Mediocre - reads like undergraduate summaries.
ArtifactsIntentionally reproduces Wikisource artifacts (nav links, page numbers).Strips all formatting.
ValidityHigh - historically grounded, internally consistent.Moderate - logically sound but platitudinous.

Winner: DataFramer (vastly superior)

The other three datasets showed consistent patterns: DataFramer maintained diversity and style fidelity while the baseline collapsed into repetitive outputs (Gutenberg: same plot structure repeated; Wiki Real Estate: 80% duplicate topics) and introduced unwanted formatting changes.

Full Evaluation Details

Evaluation summaries and full reports:

All generated outputs (both DataFramer and baseline) are available for download:

All data (seeds, DataFramer outputs, and baseline outputs) is also available on HuggingFace.


DataFramer Avoids Typical Synthetic Data Failure Modes

The blind evaluation revealed three distinct failure modes in the baseline that DataFramer successfully avoids:

Failure Mode 1: Mode Collapse

The baseline repeatedly generates the same topics or formulaic plot structures. In Wiki Real Estate, “Zoning” appeared 8 times and “Land Value Tax” 4 times out of 15 samples. In Gutenberg, every story followed the same arc: ship, island, ruins, beings, escape. In Wiki Medical, duplicate “Medical Education” articles appeared.

DataFramer avoids this through diversity injections during the outlining phase, ensuring each sample covers different ground within the topic space defined by the seeds.

Failure Mode 2: Style Drift

The baseline introduces formatting and structural elements not present in the seed data: adding Markdown headers when inputs used plain text, converting dense encyclopedic prose into bullet-point lists, and stripping source-specific formatting artifacts like Wikisource navigation and metadata.

DataFramer avoids this through iterative evaluation and editing loops that keep the generated style tightly matched to the original distribution, continuously comparing output characteristics against seed characteristics.

Failure Mode 3: Length Shrinkage

The baseline generates summaries instead of full documents. Wikisource seeds were 35k-50k tokens; baseline outputs were 2k-5k tokens. Dense, chapter-length inputs became brief essays.

DataFramer addresses this by explicitly accounting for target length during outlining and generation, maintaining long-context coherence through structured revision passes.


Discussion

When building eval datasets like this, a few things need monitoring: topic diversity, style consistency with your inputs, and length profiles. Missing any one of them can produce a dataset that tests the wrong thing.

Our evaluation showed that prompting Claude Sonnet 4.5 directly produced repetitive, formulaic outputs across all four datasets. DataFramer’s pipeline, using the same underlying model, produced dramatically better results.

If you’re building similar pipelines, watch for these three failure modes:

  • Mode collapse: Count unique topics and structures in your outputs
  • Style drift: Compare formatting to your input examples
  • Length shrinkage: Check whether outputs are significantly shorter than inputs

Beyond these, raw LLM generation can introduce fabricated values that accumulate silently across large runs. DataFramer’s verification and revision loops catch these before they reach your dataset.

For a full breakdown, see Why DataFramer.

Get started

Ready to build better AI with better data?

DataFramer helps teams take production signal traces, define the distribution of failures and interesting cases precisely, and generate labeled eval datasets that give systematic coverage of what matters. Interested in building eval coverage for your AI workflows? Just reach out.