Skip to content

The Appeal and Real-Life Consequences of Applying Synthetic Data to Sensitive Clinical Data

August 5, 2020

Even before the COVID-19 crisis, health systems, medical researchers, and medical institutions grappled with efficient ways of gathering patient data while maintaining patient privacy.

When researching for health innovation or crisis management, healthcare institutions must extract data from a multitude of systems. Answering questions about trends in chronic conditions, the viability of a treatment in a community, the utilization rates of certain procedures, or the rising costs of health care—all of these scenarios require collecting, analyzing, and sharing patient and population data.

Unfortunately, that process is fraught with possible data breaches, navigation of industry privacy regulation, dependence on healthcare IT specialists, and precious time. On top of that, compiling and researching patient data requires navigation through massive troves of data that may exist in a variety of systems that are siloed or frustratingly dispersed across differing archives.

Related: How Cloud Technology Facilitates the Management of Patient Care

Usage of patient data in clinical research

Most of the time, medical researchers must submit data requests to even access individual and population patient data. It takes time to request and receive data pulls, and even more time and skill to read and manipulate any received data. It also requires incredibly specific queries from the medical professional, researcher, or institution, that may or may not need supplemental queries for clarification. The cherry on top? All patient information must be redacted due to its sensitive nature. Compromising patient security and confidentiality by failing to remove all identifying attributes goes directly against healthcare compliance guidelines such as the Health Insurance Portability and Accountability Act (HIPAA), Health Information Technology for Economic and Clinical Health Act (HITECH), and General Data Protection Regulation (GDPR).

Electronic health records (EHR) are now digitized—the progress that has improved the storage of and access to a patient’s health records didn’t necessarily translate to a convergence of those records. The transition of legacy health care systems into more nimble, cloud-based systems didn’t immediately erase (clunky) workflows when it comes to clinical communication and collaboration. More than likely, health systems must now contend with duplicate data that must be cleaned and access controls that must be determined on a case-by-case, title-by-title basis.

All of this illustrates that there’s a reason why advancements in health care solutions, digital health, and patient satisfaction haven’t necessarily resulted in the complete and efficient transformation of the healthcare industry. This is a global problem. The U.S. healthcare system is notorious for being inefficient, but the worldwide COVID-19 pandemic has made it clear that there are global issues of data sharing, resource pooling, and research opportunities.

How do we fix this? How can we truly understand and learn from gaps in care and medical research so that we can protect everyone on the planet and possibly prevent another pandemic like COVID-19?

Synthetic data offers a compelling solution.

de-identification walked so that synthetic health data could run

Synthetic data in health care

AI Multiple’s guide to synthetic data describes the usefulness of synthetic data in cases where paramount privacy requirements limit data availability, the costs of real-life product testing negatively restricts endeavors, or datasets need to be quickly trained to be effective. Synthetic data produces statistically comparable datasets in a quicker, safer setting, allowing companies, institutions, and organizations to become more nimble, innovative, and effective.

Its application in the healthcare industry posits intriguing potential. Regardless of all the information that is entered into and accessed by medical professionals, all patient information is sensitive and requires protection and de-identification before it can be used for any research purposes. The healthcare application of synthetic data allows medical researchers to create and consult those statistically comparable datasets on fictional patients.

To be clear, these datasets are not wild shots in the dark. “Fictional patients” mean unattributable patient data; unattributable patient data strips all data of patient and demographic identifiers. The University of Copenhagen nicely sums up the attributes of these fictional patients:

attributes of fictional patients via University of Copenhagen

In a nutshell, synthetic health data adds to the scope of existing or “real” data, circumventing the issue of too little data availability.

Protecting patient identity is paramount. However, that stringent protection causes breakdowns in clinical data and clinical research workflows. For example, when a clinical care coordinator contacts hospital administrators for patient documentation, they must fax in forms, follow up with administrators over the phone, and manually input data. This is the procedure for every single patient. Clinical care coordinators must also take care to not request information too early because shared documents have a short lifespan. That is just one scenario that is already rife with bottlenecks.

Now apply that bumpy workflow to clinical researchers or pharmaceutical drug developers, who are trying to make predictions, identify trends, and determine population health initiatives on a larger scale. Sure, larger health systems may have larger databases (or data lakes) to hold all of their patients’ information, but these databases are not structured in a one-to-one way. A patient’s medical record can exist separately from their records of procedures, referrals, and ancillary care history. A patient’s medical data can even exist separately between different entities of the same company. Effectively, this results in data scarcity.

Find the best Synthetic Data Software, here.

As the youths would say, de-identification walked so that synthetic health data could run. De-identification of patient data is the censoring or removal of identifiable patient attributes for the purposes of population health research. The difference between de-identification and synthetic health data is that the latter is completely removed from patient information. Synthetic data contains zero personal data. In addition, intelligent patient data generators (iPDGs) and EHR generators can be utilized to generate synthetic patient records regardless of the amount of bulk patient data stored in a hospital’s admin system.

There’s also the amazingly acronymized FHIR. The Fast Healthcare Interoperability Resources, more commonly referred to as FHIR, helped pave the way in terms of data collection and sharing. FHIR provides the healthcare industry with a cloud-based data storage standard that improves health information exchange (HIE) and data interoperability. FHIR significantly improves clinical communication and collaboration by enabling the tagging and organizing of clinical data within a healthcare organization’s data system.

Robert Lieberthal's quote on synthetic data as a solution in the healthcare industry

Robert Lieberthal, health economics principal at The MITRE Corporation, believes that “Synthetic data is a solution to many of the problems that plague our health IT system…In a way, synthetic data represents current health IT standards while also incorporating the best of what health IT could be.”

Once synthetic data solutions are integrated within the databases of a healthcare organization, it ingests all data points, automating data de-duplication and cleaning, capturing statistical insights and relationships between data points, and facilitating data sharing, delivery, and modeling.  

Again, because synthetic data does not contain protected health information, the generated artificial data can be shared between medical and clinical researchers and scientists. They are no longer constrained to utilizing redacted patient information that may or may not adhere to healthcare compliance guidelines when developing new health strategies, payment initiatives, and health policies, and digital health development.

Concerns of utilizing synthetic data

highlighting the concerns of utilizing synthetic data

While the benefits of generating and applying synthetic data to health care are clear, it’s still in the early stages of adoption and implementation. Detractors of synthetic data do exist, and for good reason, as with any solution that relies on machine learning and automation to hone and polish.

There are limitations to synthetic data in a healthcare setting, and all stakeholders who want to leverage synthetic data must be aware of them.
  • Variance — Patients are human and therefore, are made up of variances and complexities that cannot necessarily be predicted or replicated by synthetic data. Artificially generated health data may only be able to simulate general or “average” trends in general clinical applications.
  • “Real” (observational) data validity — Synthetic data, by definition, is not an exact replica of patient data. While synthetic data can be manipulated to fit whatever scenario a researcher uses it for, it is still rooted in an initial real-life dataset. In other words, the results of synthetic data are pending until it’s validated by real, observational data that can improve upon the predictions of the artificially generated health data. Additionally, its dependence on the quality of the data source can significantly impact the quality of generated synthetic health data.
  • User acceptance and widespread applicationSynthetic data software prides itself on randomizing real-life data into unrecognizable and unattributable data points. However, there is still a chance (possibly a one-in-a-million chance) that the randomized data point is an actual data point. The benefits of synthetic data has not yet been experienced by everyone in the world, and its potential may, in fact, turn away researchers or governments who doubt the accuracy or validity of predictions based on artificial data.

Players in synthesized health care data

Synthetic data, and particularly synthetic health data, is a relatively new forum in which research is conducted. Correspondingly, the following list of synthetic health data players is short but will grow as this healthcare technology becomes more widely accepted and improved upon.


MDClone is an Israel-based health IT vendor with the mission of easing access to health data and improving overall methods of health research and activity. MDClone’s platform intends to democratize data across the healthcare ecosystem by enabling the broad use of data that resides inside health systems.  


Synthea is an open-source, synthetic patient data generator that can be used to create models of medical history of synthetic patients. Synthea’s free data lake enables health data research while adhering to privacy and security restrictions, regardless of the healthcare industry.


Statice has developed privacy-compliant data anonymization solutions that can be used by businesses and organizations across all industries. Statice enables healthcare institutions to work faster, safer, and in compliance, while furthering research, development, and delivery of patient care.


Consulting firm Lynkeus led the European Union-funded MyHealthMyData (MHMD) project that aimed—and succeeded—to prove the validity and usefulness of making anonymized (read: synthetic) data available for open research.


The Human Data Science Company, IQVIA collaborated with biopharma research company AstraZeneca to develop the synthetic database Simulacrum. Simulacrum is comprised of solely artificial (read: synthetic) data to conduct research and perform analyses on population cancer care.

Way forward

The potential impact of creating and utilizing synthetic data to improve clinical research and health strategies is huge. As with most things, it takes time for an industry to reap the benefits from a new kind of technology or process before everyone gets on board. However, during a worldwide health crisis, we’re short on time and resources. Both the regional and global medical communities must take cues from the current leaders in synthetic health data to transform how they share and protect patient data, encourage clinical collaboration, and instigate necessary change in their approach to creating and improving health plans, strategies, and initiatives.

Read More: Telemedicine’s Critical Role in the COVID-19 Crisis

Don’t fall behind.

Subscribe to the latest software news & updates from the expert analysts at G2.

By submitting this form, you are agreeing to receive marketing communications from G2.