How to Generate Multi-file EHR Datasets for 1000 patients with exact distributions

Generate privacy-safe synthetic EHR/EMR datasets from a few patient samples. See how DataFramer turns limited EHR data into rich medical and insurance datasets in 5 steps with the exact required distributions.

From Two Patient Samples to Hundreds in 5 Easy Steps

Puneet Anand

Wed Oct 15

Download 1000 patient records generated in this video

Download Now

Why EHR/EMR data is hard to access for healthcare AI

Healthcare innovation depends on data. Every discovery, every improvement in treatment, begins with information. Yet, real EHR/EMR data is extremely difficult to access. Privacy laws such as HIPAA and GDPR protect sensitive information, and while they are essential, they also restrict how much data researchers and AI developers can use.

Even when EHR datasets or other medical datasets become available, they’re often incomplete, de-identified, or stored in silos across departments. A researcher might have a lab report but not the corresponding imaging study or discharge summary. This fragmentation makes it hard to train machine learning models that reflect the diversity and complexity of real-world patients.

Teams spend months requesting access, cleaning EHR data, and managing compliance, only to end up with small, narrow datasets that can’t support robust AI systems. Innovation slows not because of a lack of ideas, but because usable data remains out of reach.

What counts as EHR/EMR data in a usable dataset?

When teams talk about EHR data, EMR data, or EHR datasets, they usually mean more than just a flat table. Usable healthcare and insurance medical datasets typically include:

  • Structured data: demographics, vitals, diagnosis codes, procedure codes, medications, allergies, lab panels, and problem lists.
  • Unstructured data: discharge summaries, operative notes, radiology reports, ICU notes, progress notes, and referral letters.
  • Longitudinal patient journeys: multiple encounters over time, with linked events such as admissions, follow-ups, readmissions, and procedures.
  • Multi-document patient folders: each patient has a set of files—ECG traces, stress tests, lab reports, imaging narratives, and summaries, that together form a single clinical story.
  • Standard data exports: FHIR/HL7-based exports, PDFs, text documents, CSVs, and other formats that analytics and AI systems consume in production.

A practical synthetic workflow needs to reproduce this richness so that your EHR datasets are realistic enough for model development, evaluation, and downstream analytics.

Synthea datasets vs real-seeded synthetic EHR datasets

Open-source Synthea data and Synthea datasets are widely used to experiment with healthcare AI and analytics. They’re excellent for:

  • Getting started quickly with standardized, simulated patient records.
  • Demonstrating pipelines, dashboards, and basic models.
  • Teaching or prototyping when you have no access to real-world EHR data.

However, many teams eventually find that Synthea alone isn’t enough:

  • It may not match your institution’s documentation style, templates, or clinical workflows.
  • It may not reflect your specialty mix, comorbidities, or real-world coding patterns.
  • It can be difficult to tune to your exact distributions or to mirror how your clinicians actually write notes.

DataFramer takes a complementary approach: you start from a small set of real EHR/EMR samples and use them as seeds. If you don’t have them available, you can use the “Seedless” generation feature to first craft your required structures and formats.

Side note: You drive the entire workflow through a UI or an API.

The platform then generates synthetic EHR datasets that:

  • Are tuned to your target distributions (e.g., disease prevalence, age bands, comorbidity profiles, physician notes, markers, tests, etc).
  • You can even configure dependent or conditional distributions. For ex. if you want a higher prevelance of diabetes in an older male age groups
  • Inherit realistic structure and language from your own environment.
  • Remain privacy-safe by decoupling generated data from real identities.

In practice, teams often use Synthea datasets for early experiments and then switch to real-seeded synthetic EHR datasets when they need higher realism and closer alignment with production data.

Synthetic data offers a practical path forward

Synthetic data is artificially generated information that preserves the structure, patterns, and statistical relationships of real datasets without exposing any real identities.

In healthcare AI, this approach helps teams safely create training and evaluation EHR datasets, EMR data extracts, and broader medical datasets that reflect genuine, real-world clinical complexity while staying compliant.

Instead of negotiating large data transfers or exposing live systems, you can work with synthetic EHR/EMR data that behaves like the real thing for modeling and analytics, but is designed to protect patient privacy and reduce regulatory friction.

Introducing DataFramer

DataFramer is a synthetic-data generation platform designed to help organizations build, test, and deploy AI systems without exposing sensitive real-world data. It lets you:

  • Turn a handful of real patient folders into large synthetic EHR datasets.
  • Generate structured and unstructured medical datasets for healthcare AI.
  • Support adjacent use cases in health insurance datasets and life insurance datasets without accessing raw production systems.

Here’s a breakdown of how it can be used, especially in domains like healthcare and insurance.

How the five steps work for EHR datasets in practice

You will follow a clear five-step workflow that mirrors the demo transcript. Here are the highlights:

Step 1: Uploading a few representative patient folders as seed data

Start with a handful of representative seed records that you upload to DataFramer. Make sure each one includes the kinds of documents your models will eventually see in production, such as stress tests, ECGs, lab results, imaging, and discharge summaries. Validate that each patient’s files share a consistent identifier so relationships across documents remain intact. These seeds define the structure and content patterns of your target EHR dataset.

Note: As mentioned before, if you don’t have the seed samples available, you can use the “Seedless” generation feature to first craft your required structures and formats.

Step 2: Create a spec (the blueprint)

DataFramer analyzes the seed data and builds a blueprint of the structure and properties. This spec captures the document types, expected structure, and baseline distributions that will guide synthetic generation.

Step 3: Edit the spec to match your target distributions and requirements

Now you can refine what gets generated. Require unique names, add or refine data properties, and control distributions. For example, you can increase coronary artery disease distribution, boost diabetes prevalence, and shift the demographic mix toward female and elderly patients.

You can also define conditional rules, like generating more stress test reports when the condition is coronary artery disease.

Step 4: Run generations to create a larger synthetic dataset

Next, generate a thousand synthetic samples with one folder per patient—new names, closely following your desired distributions and requirements, with new histories, and realistic structure - ready for safe testing, training, validation, and demos.

Step 5: Evaluate and iterate (with humans if needed)

Finally, evaluate the generated dataset, chat with your dataset, and involve human experts as needed. This makes it easy to validate whether your synthetic EHR dataset matches your targets, and to iterate until it does.

Let’s visit these steps in more detail in this demo video, where we generate 1000 realistic samples.

Detailed Walkthrough of The 5 Step Workflow

Here is a step by step walkthrough of DataFramer from this demo.

Prerequisites and recommended inputs

  • A small, representative set of seed files for each subject or entity. Examples in healthcare include stress tests, ECG reports, lab results, imaging, discharge summaries, and patient profiles.

    • Note: As mentioned before, if needed DataFramer can generate new samples from scratch to use as seeds. This feature is called “Seedless” generation.
  • Clear target goals for distributions and attributes to control during generation. Examples include disease prevalence, gender balance, age groups, and other medical dataset features.

Step 1 Upload EHR/EMR seed data

Purpose
Ingest and organize EHR/EMR seed samples that define the structure and context for synthetic generation.

What you do

  1. Select dataset mode. For multi-file subjects, choose multi-folder so each subject can include multiple documents.
  2. Upload the root folder or select folders for each subject.
  3. Provide a dataset name and description that reflects your EHR dataset use case (e.g., Patient history seed samples).

What DataFramer does

  • Stores relevant files.
  • Prepares the dataset for analysis and specification creation.

Tips

  • Include the document types your models must see later. DataFramer supports PDF for example.

Step 2 Create specs for your synthetic EHR dataset

Purpose
Generate the initial blueprint that controls how synthetic data will be generated, including structure, properties, baseline, and even conditional distributions inferred from your seed data.

What you do

  1. Click “Create spec” on the dataset.
  2. Review the auto-populated spec that summarizes structure, file counts per subject, content types, and detected properties such as demographics, medical history, and clinical findings.

What DataFramer does

  • Analyzes the seed dataset to infer structure and candidate properties.
  • Pre-populates distributions and relationships discovered in seed data.
  • If indicated by the user, DataFramer also expands the existing set of properties and their possible values.

Outputs

  • An initial specification that describes structure and baseline properties, ready for refinement.

Step 3 Edit the spec to control properties, requirements, and distributions

Purpose
Refine the blueprint so the generated synthetic dataset matches your target populations, document patterns, and clinical logic.

What you do

  1. Configure target distributions for key properties. Some examples:

    • Increase coronary disease prevalence.
    • Create a new medical disease property value for diabetes and raise its prevalence.
    • Emphasize elderly female representation.
  2. Add or refine target dataset requirements.

    • Include medical condition as an explicit data property.
    • Require all first and last names to be unique.
    • Add additional records such as physician histories or evidence fields.
  3. Encode conditional relationships so the dataset behaves like real journeys. Example rule:

    • When medical condition is coronary artery disease, primary report type is stress test about 40 percent of the time and operative report about 15 percent of the time.

Read More: Base Distributions Conditional Distributions

What DataFramer does

  • Validates edits to ensure constraints are consistent.
  • Updates the spec so the generated dataset adheres to your requirements and logical relationships.

Outputs

  • A finalized specification that fully describes the target structure, properties, distributions, and conditional logic for generating your realistic synthetic EHR datasets and other medical datasets.

Good practices

  • Prefer conditional rules for any property that depends on another property.
  • Keep distributions realistic enough to preserve utility while achieving your research goals.

Step 4 Create runs and generate synthetic EHR datasets

Purpose
Execute a run with the specification to produce a synthetic EHR dataset at the desired scale.

What you do

  1. Click Create run from the saved specification.
  2. Select the spec version, choose the model, and set the number of samples to generate.
    • Models can be proprietary or open source based on your environment.
  3. Choose whether to enable revisions.
    • Revisions perform additional passes to check whether outputs meet your requirements and distributions before finalizing.
  4. Start the run and monitor progress.

What DataFramer does

  • Applies your spec to generate new patients with multiple files per patient.
  • Preserves cross-file consistency and adheres to target distributions and conditional rules.
  • Performs revision cycles if desired to improve fit to targets.

Outputs

  • A generated EHR dataset with one folder per synthetic subject. Typical contents in this healthcare demo included operative notes, ICU sheets, lab results, discharge summaries, imaging or test narratives, and patient demographics, but other files can be added as required, for example insurance applications or submissions.

Performance notes

  • Time to completion scales with sample count, model selection, and revision settings.
  • Larger sample sizes converge more closely to your target distributions.

Step 5 Evaluate the generated dataset and iterate

DataFramer automatically evaluates the output against the targets and expectations set by you.

For example, if desired, all samples could be female, and/or 75 percent elderly, and conditions like type 2 diabetes, hypertension, and coronary artery disease can be configured with the highest frequencies as intended.

We can also use the chat feature to query the dataset directly, asking, for example, how many samples were generated and requesting a table of diseases by frequency, receiving instant structured replies. This makes it easier to validate whether your synthetic EHR dataset matches your clinical or business hypotheses.

You can also involve human experts as needed to review and annotate outputs for realism, consistency, and safety, and then iterate on the spec and rerun generations until the dataset meets your standards.

Using synthetic health datasets for insurance AI and Analytics

Beyond hospital and research settings, synthetic medical datasets are increasingly valuable for insurers:

  • Health insurance datasets

    • Simulate claims-like records based on synthetic EHR/EMR journeys.
    • Model utilization patterns, chronic disease burden, and cost drivers.
    • Test care management, risk adjustment, and network design strategies without exposing member PHI.
  • Life insurance datasets

    • Generate synthetic underwriting-style summaries that incorporate comorbidities, risk factors, and lifestyle indicators derived from clinical context.
    • Explore how changes in age, condition prevalence, or treatment adherence affect mortality and morbidity assumptions.
    • Share synthetic life insurance datasets across actuarial, underwriting, and data science teams to prototype new products and risk models.

Because DataFramer starts from a small, well-governed seed of EHR/EMR data, it becomes possible to create realistic, privacy-safe health insurance datasets and life insurance datasets that still behave like real populations.

The impact of synthetic data in healthcare

Synthetic data gives researchers and developers freedom to experiment, share, and iterate without risking privacy. Privacy is protected because no real identifiers are used. Datasets can be balanced to include diverse groups. Development timelines shrink from months to hours. Collaboration across institutions becomes straightforward and safe.

DataFramer turns small, limited EHR/EMR datasets into abundant, flexible resources that fuel responsible AI development across both healthcare and insurance.

FAQ: EHR datasets, EMR data, Synthea, and insurance use cases

What is the difference between EHR data and EMR data?

EHR data usually refers to a longitudinal view of a patient’s health across multiple encounters and care settings, while EMR data often refers to the digital chart within a single organization or encounter. In practice, most AI teams work with both, and DataFramer can generate synthetic versions of either as multi-file patient folders.

What are EHR datasets used for in machine learning?

EHR datasets and other medical datasets are used to:

  • Train and evaluate prediction models (readmission, mortality, length of stay, risk scores).
  • Power clinical decision support tools.
  • Build phenotyping, cohort selection, and trial-matching systems.
  • Support downstream analytics for providers, payers, and life sciences.

Synthetic datasets let you do this work without exposing production systems.

How does synthetic EHR data compare to Synthea datasets?

Synthea datasets are open and standardized, making them ideal for early experimentation and teaching. Real-seeded synthetic EHR datasets generated with DataFramer:

  • Are tailored to your specialty mix and workflows.
  • Use language and formatting closer to your real documentation.
  • Let you tune distributions and rules to match your target population.

Many teams combine both: Synthea for quick demos, and real-seeded synthetic EHR data for serious model development.

How can synthetic EHR data be used inside healthcare organizations?

Healthcare organizations can use synthetic EHR/EMR data to:

  • Prototype and validate new AI tools in a safe sandbox before touching live records.
  • Share realistic datasets with vendors, startups, and research partners without moving PHI, in the desired formats like FHIR.
  • Run quality-improvement and operations experiments (e.g., capacity planning, triage flows) on realistic but de-identified journeys.
  • Train clinicians, analysts, and data science teams on lifelike cases without compliance hurdles.

Because the data is synthetic, these use cases become much easier to approve and govern.

Can I generate datasets for underwriting or insurance risk modeling?

Yes. By starting from carefully governed clinical seeds, you can create synthetic health insurance datasets and life insurance datasets that:

  • Capture realistic condition combinations, treatments, and outcomes.
  • Support underwriting, pricing, and product analytics.
  • Stay privacy-safe because no real policyholder or patient identities are exposed.

How large should an EHR dataset be for model evaluation?

It depends on the task, but in general:

  • Hundreds of samples can be enough for exploratory models.
  • Thousands to tens of thousands of samples are often used for robust evaluation.
  • With synthetic data, you can scale your EHR datasets to these sizes and beyond, comfortably, while still anchoring them in a small, carefully curated real-seed cohort!!

A future where privacy and innovation coexist

In this demo, starting from just two real patients, DataFramer created 1000 complete, realistic samples. This process can be scaled to thousands, and the resulting EHR datasets and medical datasets can serve rapid, safe AI training and evaluation.

Healthcare AI will only reach its potential when data becomes both accessible and ethical. Synthetic EHR/EMR data - whether used for hospitals, health insurers, or life insurers—offers a practical way to get there.

"We strive to start each relationship with establishing trust and building a long-term partnership. That is why we offer a complimentary dataset to all our customers to help them get started."

Puneet Anand, CEO

DataFramer

Ready to Get Started?

Contact our team to learn how we can help your organization develop AI systems that meet the highest standards.

Book a Meeting