From Two Patient Samples to Hundreds in 3 Easy Steps
Discover how DataFramer helps healthcare AI teams overcome patient data scarcity by generating realistic, privacy-safe synthetic datasets that reflect real clinical data without compromising privacy.
Puneet Anand
Wed Oct 15
The challenge of limited healthcare data
Healthcare innovation depends on data. Every discovery, every improvement in treatment, begins with information. Yet, real patient data is extremely difficult to access. Privacy laws such as HIPAA and GDPR protect sensitive information, and while they are essential, they also restrict how much data researchers and AI developers can use.
Even when data becomes available, it’s often incomplete, de-identified, or stored in silos across departments. A researcher might have a lab report but not the corresponding imaging study or discharge summary. This fragmentation makes it hard to train machine learning models that reflect the diversity and complexity of real-world patients.
Teams spend months requesting access, cleaning data, and managing compliance, only to end up with small, narrow datasets that can’t support robust AI systems. Innovation slows not because of a lack of ideas, but because usable data remains out of reach.
Synthetic data offers a practical path forward
Synthetic data is artificially generated information that preserves the structure, patterns, and statistical relationships of real datasets without exposing any real identities. In healthcare AI, this approach helps teams safely create training and evaluation data that reflects genuine, real-world clinical complexity while staying compliant.
Introducing DataFramer
DataFramer is a synthetic-data generation platform designed to help organizations build, test and deploy AI systems without exposing sensitive real-world data. Here’s a breakdown how it can be used, especially in domains like healthcare.
How the three steps work in practice
You will follow a clear three-step workflow that mirrors the demo transcript. Here are the highlights:
Step 1: Uploading a few representative patient folders as seed data
Start with a handful of representative seed records that you upload to DataFramer. Make sure each one includes the kinds of documents your models will eventually see in production, such as stress tests, ECGs, lab results, imaging, and discharge summaries. Validate that each patient’s files share a consistent identifier so relationships across documents remain intact.
Step 2: Use those seeds to build a precise specification that sets distributions and expected requirements
Decide and encode into DataFramer the target distributions you want to study or balance, like increasing coronary artery disease or diabetes prevalence or emphasizing elderly populations. Encode clinical logic that mirrors real workflows, such as when a stress test or an operative report is likely to appear.
Step 3: Runs generations to create a larger synthetic dataset that is pre-evaluated and iterated until it matches your targets.
Finally, use DataFramer to expand the dataset from the seed files, then evaluate whether the generated samples match your targets and iterate until they do. This is a repeatable, compliant method to move from scarce data to learning-ready data without risking patient privacy.
Let’s visit these steps in more detail in this demo video, where we generate 50 realistic samples in around 9 minutes.
Here is a step by step walkthrough of DataFramer from this demo.
Prerequisites and recommended inputs
- A small, representative set of seed files for each subject or entity. Examples in healthcare include stress tests, ECG reports, lab results, imaging, discharge summaries, and patient profiles.
- Consistent identifiers across files for the same subject to preserve cross-document linkage.
- Clear target goals for distributions and attributes to control during synthesis. Examples include disease prevalence, gender balance, and age groups.
Step 1 Upload seed data
Purpose
Ingest and organize real seed samples that define the structure and context for synthetic generation.
What you do
- Select dataset mode. For multi-file subjects, choose multi-folder so each subject can include multiple documents.
- Upload the root folder or select folders for each subject.
- Provide a dataset name and description.
What DataFramer does
- Detects and indexes relevant files.
- Preserves intra-subject relationships by maintaining a consistent subject ID across files.
- Prepares the dataset for analysis and specification creation.
Outputs
- A dataset object containing your uploaded seed files, ready to use for spec generation.
Tips
- Include the document types your models must see later.
- Ensure file names, content formats, and IDs are consistent within each subject.
Step 2 Create specs
Purpose
Define the blueprint that controls how synthetic data is generated, including structure, properties, distributions, and clinical or business logic.
What you do
-
Click Create spec on the dataset.
-
Review and modify the auto-populated spec that summarizes structure, file counts per subject, content types, and detected properties such as demographics, medical history, and clinical findings.
-
Configure target distributions for key properties. Some examples:
- Increase coronary artery disease prevalence to the 40 to 50 percent range.
- Raise diabetes prevalence.
- Emphasize elderly female representation.
-
Add or refine target dataset requirements.
- Include medical condition as an explicit data property.
- Require all first and last names to be unique.
- Add additional records such as physician histories or evidence fields.
-
Encode conditional relationships so the dataset behaves like real journeys. Example rule
- When medical condition is coronary artery disease, primary report type is stress test about 40 percent of the time and operative report about 15 percent of the time.
What DataFramer does
- Analyzes the seed dataset to infer structure and candidate properties.
- Pre-populates distributions and relationships discovered in seed data.
- Validates edits to ensure constraints are consistent.
Outputs
- A saved specification that fully describes target structure, properties, distributions, and conditional logic for generating your realistic synthetic datasets.
Good practices
- Prefer conditional rules for any property that depends on another property.
- Keep distributions realistic enough to preserve utility while achieving your research goals.
Step 3 Create runs and generate synthetic patient data
Purpose
Execute the specification to produce a synthetic dataset at the desired scale.
What you do
- Click Create run from the saved specification.
- Select the spec version, choose the model, and set the number of samples to generate.
- Models can be proprietary or open source based on your environment.
- Choose whether to enable revisions.
- Revisions perform additional passes to check whether outputs meet your requirements and distributions before finalizing.
- Start the run and monitor progress.
What DataFramer does
- Applies your spec to generate new subjects with multiple files per subject.
- Preserves cross-file consistency and adheres to target distributions and conditional rules.
- Optionally performs revision cycles to improve fit to targets.
Outputs
- A generated dataset with one folder per synthetic subject. Typical contents in this healthcare demo included operative notes, ICU sheets, lab results, discharge summaries, imaging or test narratives, and patient demographics but other files can be added as required.
Performance notes
- Time to completion scales with sample count, model selection, and revision settings.
- Larger sample sizes converge more closely to your target distributions.
Reviewing and interacting with the results
DataFramer automatically evaluates the output against the targets and expectations set by you.
In this case, all samples were female, and 75 percent were elderly, and conditions like type 2 diabetes, hypertension, and coronary artery disease appeared with the highest frequencies as intended.
We can also use the chat feature to query the dataset directly, asking for example, how many samples were generated and requesting a table of diseases by frequency, receiving instant structured replies.
The impact of synthetic data in healthcare
Synthetic data gives researchers and developers freedom to experiment, share, and iterate without risking privacy. Privacy is protected because no real identifiers are used. Datasets can be balanced to include diverse groups. Development timelines shrink from months to hours. Collaboration across institutions becomes straightforward and safe.
DataFramer turns small, limited datasets into abundant, flexible resources that fuel responsible AI development.
A future where privacy and innovation coexist
In this demo, starting from just two real patients, DataFramer created fifty complete, realistic samples in just 9 minutes. This process can be scaled to thousands and the datasets can serve rapid, safe AI training and evaluation.
Healthcare AI will only reach its potential when data becomes both accessible and ethical.
"We strive to start each relationship with establishing trust and building a long-term partnership. That is why, we offer a complimentary dataset to all our customers to help them get started."
Ready to Get Started?
Contact our team to learn how we can help your organization develop AI systems that meet the highest standards.