DataFramer
DataFramer

Generate synthetic data directly in Databricks

From just a few seed examples, create multiple dataset types with full control over schema, constraints, and distributions. Get pre-evaluated, privacy-safe synthetic data—ideal for AI Evaluations, Testing, Benchmarking, and LLM post-training. Powered by your Databricks Model Serving endpoints.

What makes DataFramer unique

Few-shot to full datasets

Generates multiple dataset types from just a few seed examples—no massive training sets required.

Full control

Full control over schema, constraints, and data property distributions so outputs match your requirements.

Pre-evaluated quality

Pre-evaluated outputs for quality, consistency, and realism before you use them in pipelines.

Expert review when needed

Optional human expert review for regulated or domain-specific needs.

Built for real workloads

Ideal for LLM training, text extraction, tabular modeling, and scenario simulation.

Privacy-safe

Privacy-safe synthetic alternatives to PHI, PII, and other sensitive data.

How it works

Install the pydataframer-databricks connector, point it at any Unity Catalog table, and generate synthetic datasets that land back as Delta tables.

Unity Catalog Your source table
DataFramer Generate & evaluate
Delta Table Ready for downstream use
databricks_notebook.py
# Connect to your Databricks workspace
connector = DatabricksConnector(dbutils, scope="dataframer")

# Fetch seed data from any Unity Catalog table
seed_df = connector.fetch_sample_data(
    table_name="catalog.schema.my_table",
    num_items_to_select=25
)

# ... generate synthetic data via DataFramer ...

# Load results back into a Delta table
connector.load_generated_data(
    table_name="catalog.schema.synthetic_output",
    downloaded_zip=generated_zip,
    dataset_type=DatasetType.SINGLE_FILE,
    file_type=FileType.CSV
)

Service principal auth

OAuth M2M tokens via Databricks Secrets.

Your models, your data

Spec and sample generation run through Databricks Model Serving. Data never leaves your environment.

Standard catalog permissions

Uses existing USE CATALOG, SELECT, and MODIFY grants. No special setup.

Full round-trip

Read from any catalog table, generate synthetic data, and write back as Delta, all in one workflow.

Arbitrarily large samples

Generate as much high-quality synthetic data as you need for ML training, analytics, and testing.

CSV, JSON, and JSONL

Supports single-file and multi-file dataset structures across common file formats.

Ready to generate synthetic data in Databricks?

Follow the step-by-step guide or dive into the full documentation.