From Two Patient Samples to Hundreds in 3 Easy Steps

The challenge of limited healthcare data

Healthcare innovation depends on data. Every discovery, every improvement in treatment, begins with information. Yet, real patient data is extremely difficult to access. Privacy laws such as HIPAA and GDPR protect sensitive information, and while they are essential, they also restrict how much data researchers and AI developers can use.

Even when data becomes available, it’s often incomplete, de-identified, or stored in silos across departments. A researcher might have a lab report but not the corresponding imaging study or discharge summary. This fragmentation makes it hard to train machine learning models that reflect the diversity and complexity of real-world patients.

Teams spend months requesting access, cleaning data, and managing compliance, only to end up with small, narrow datasets that can’t support robust AI systems. Innovation slows not because of a lack of ideas, but because usable data remains out of reach.

Synthetic data offers a practical path forward

Synthetic data is artificially generated information that preserves the structure, patterns, and statistical relationships of real datasets without exposing any real identities. In healthcare AI, this approach helps teams safely create training and evaluation data that reflects genuine, real-world clinical complexity while staying compliant.

Introducing DataFramer

DataFramer is a synthetic-data generation platform designed to help organizations build, test and deploy AI systems without exposing sensitive real-world data. Here’s a breakdown how it can be used, especially in domains like healthcare.

How the three steps work in practice

You will follow a clear three-step workflow that mirrors the demo transcript. Here are the highlights:

Step 1: Uploading a few representative patient folders as seed data

Start with a handful of representative seed records that you upload to DataFramer. Make sure each one includes the kinds of documents your models will eventually see in production, such as stress tests, ECGs, lab results, imaging, and discharge summaries. Validate that each patient’s files share a consistent identifier so relationships across documents remain intact.

Step 2: Use those seeds to build a precise specification that sets distributions and expected requirements

Decide and encode into DataFramer the target distributions you want to study or balance, like increasing coronary artery disease or diabetes prevalence or emphasizing elderly populations. Encode clinical logic that mirrors real workflows, such as when a stress test or an operative report is likely to appear.

Step 3: Runs generations to create a larger synthetic dataset that is pre-evaluated and iterated until it matches your targets.

Finally, use DataFramer to expand the dataset from the seed files, then evaluate whether the generated samples match your targets and iterate until they do. This is a repeatable, compliant method to move from scarce data to learning-ready data without risking patient privacy.

Let’s visit these steps in more detail in this demo video, where we generate 50 realistic samples in around 9 minutes.

Here is a step by step walkthrough of DataFramer from this demo.

Prerequisites and recommended inputs

A small, representative set of seed files for each subject or entity. Examples in healthcare include stress tests, ECG reports, lab results, imaging, discharge summaries, and patient profiles.
Consistent identifiers across files for the same subject to preserve cross-document linkage.
Clear target goals for distributions and attributes to control during synthesis. Examples include disease prevalence, gender balance, and age groups.

Step 1 Upload seed data

Purpose
Ingest and organize real seed samples that define the structure and context for synthetic generation.

What you do

Select dataset mode. For multi-file subjects, choose multi-folder so each subject can include multiple documents.
Upload the root folder or select folders for each subject.
Provide a dataset name and description.

What DataFramer does

Detects and indexes relevant files.
Preserves intra-subject relationships by maintaining a consistent subject ID across files.
Prepares the dataset for analysis and specification creation.

Outputs

A dataset object containing your uploaded seed files, ready to use for spec generation.

Tips

Include the document types your models must see later.
Ensure file names, content formats, and IDs are consistent within each subject.

Step 2 Create specs

Purpose
Define the blueprint that controls how synthetic data is generated, including structure, properties, distributions, and clinical or business logic.

What you do

Click Create spec on the dataset.
Review and modify the auto-populated spec that summarizes structure, file counts per subject, content types, and detected properties such as demographics, medical history, and clinical findings.
Configure target distributions for key properties. Some examples:
- Increase coronary artery disease prevalence to the 40 to 50 percent range.
- Raise diabetes prevalence.
- Emphasize elderly female representation.
Add or refine target dataset requirements.
- Include medical condition as an explicit data property.
- Require all first and last names to be unique.
- Add additional records such as physician histories or evidence fields.
Encode conditional relationships so the dataset behaves like real journeys. Example rule
- When medical condition is coronary artery disease, primary report type is stress test about 40 percent of the time and operative report about 15 percent of the time.

What DataFramer does

Analyzes the seed dataset to infer structure and candidate properties.
Pre-populates distributions and relationships discovered in seed data.
Validates edits to ensure constraints are consistent.

Outputs

A saved specification that fully describes target structure, properties, distributions, and conditional logic for generating your realistic synthetic datasets.

Good practices

Prefer conditional rules for any property that depends on another property.
Keep distributions realistic enough to preserve utility while achieving your research goals.

Step 3 Create runs and generate synthetic patient data

Purpose
Execute the specification to produce a synthetic dataset at the desired scale.

What you do

Click Create run from the saved specification.
Select the spec version, choose the model, and set the number of samples to generate.
- Models can be proprietary or open source based on your environment.
Choose whether to enable revisions.
- Revisions perform additional passes to check whether outputs meet your requirements and distributions before finalizing.
Start the run and monitor progress.

What DataFramer does

Applies your spec to generate new subjects with multiple files per subject.
Preserves cross-file consistency and adheres to target distributions and conditional rules.
Optionally performs revision cycles to improve fit to targets.

Outputs

A generated dataset with one folder per synthetic subject. Typical contents in this healthcare demo included operative notes, ICU sheets, lab results, discharge summaries, imaging or test narratives, and patient demographics but other files can be added as required.

Performance notes

Time to completion scales with sample count, model selection, and revision settings.
Larger sample sizes converge more closely to your target distributions.

Reviewing and interacting with the results

DataFramer automatically evaluates the output against the targets and expectations set by you.

In this case, all samples were female, and 75 percent were elderly, and conditions like type 2 diabetes, hypertension, and coronary artery disease appeared with the highest frequencies as intended.

We can also use the chat feature to query the dataset directly, asking for example, how many samples were generated and requesting a table of diseases by frequency, receiving instant structured replies.

The impact of synthetic data in healthcare

Synthetic data gives researchers and developers freedom to experiment, share, and iterate without risking privacy. Privacy is protected because no real identifiers are used. Datasets can be balanced to include diverse groups. Development timelines shrink from months to hours. Collaboration across institutions becomes straightforward and safe.

DataFramer turns small, limited datasets into abundant, flexible resources that fuel responsible AI development.

A future where privacy and innovation coexist

In this demo, starting from just two real patients, DataFramer created fifty complete, realistic samples in just 9 minutes. This process can be scaled to thousands and the datasets can serve rapid, safe AI training and evaluation.

Healthcare AI will only reach its potential when data becomes both accessible and ethical.

The challenge of limited healthcare data

Synthetic data offers a practical path forward

Introducing DataFramer

How the three steps work in practice

Step 1 Upload seed data

Step 2 Create specs

Step 3 Create runs and generate synthetic patient data

Reviewing and interacting with the results

The impact of synthetic data in healthcare

A future where privacy and innovation coexist

Ready to Get Started?

Get A Free Dataset

Get A Free Dataset