Generate synthetic data directly in Databricks
From just a few seed examples, create multiple dataset types with full control over schema, constraints, and distributions. Get pre-evaluated, privacy-safe synthetic data—ideal for AI Evaluations, Testing, Benchmarking, and LLM post-training. Powered by your Databricks Model Serving endpoints.
What makes DataFramer unique
Few-shot to full datasets
Generates multiple dataset types from just a few seed examples—no massive training sets required.
Full control
Full control over schema, constraints, and data property distributions so outputs match your requirements.
Pre-evaluated quality
Pre-evaluated outputs for quality, consistency, and realism before you use them in pipelines.
Expert review when needed
Optional human expert review for regulated or domain-specific needs.
Built for real workloads
Ideal for LLM training, text extraction, tabular modeling, and scenario simulation.
Privacy-safe
Privacy-safe synthetic alternatives to PHI, PII, and other sensitive data.
How it works
Install the pydataframer-databricks connector, point it at any Unity Catalog table, and generate synthetic datasets that land back as Delta tables.
# Connect to your Databricks workspace
connector = DatabricksConnector(dbutils, scope="dataframer")
# Fetch seed data from any Unity Catalog table
seed_df = connector.fetch_sample_data(
table_name="catalog.schema.my_table",
num_items_to_select=25
)
# ... generate synthetic data via DataFramer ...
# Load results back into a Delta table
connector.load_generated_data(
table_name="catalog.schema.synthetic_output",
downloaded_zip=generated_zip,
dataset_type=DatasetType.SINGLE_FILE,
file_type=FileType.CSV
) Service principal auth
OAuth M2M tokens via Databricks Secrets.
Your models, your data
Spec and sample generation run through Databricks Model Serving. Data never leaves your environment.
Standard catalog permissions
Uses existing USE CATALOG, SELECT, and MODIFY grants. No special setup.
Full round-trip
Read from any catalog table, generate synthetic data, and write back as Delta, all in one workflow.
Arbitrarily large samples
Generate as much high-quality synthetic data as you need for ML training, analytics, and testing.
CSV, JSON, and JSONL
Supports single-file and multi-file dataset structures across common file formats.
Ready to generate synthetic data in Databricks?
Follow the step-by-step guide or dive into the full documentation.