Synthetic data lets developers use artificial patient data that’s generated with statistical realism, and free from sensitive protected health information.
Synthetic data may be AI-generated, but its popularity in healthcare is the real deal. I’d like to add on to a recent article explaining synthetic data from The Medical Futurist - a research institute specializing in digital health - and talk about how we use synthetic data at Particle. It’s easy to see that synthetic data has found its place in health tech development.
“The biggest obstacle to A.I. is the inadequacy of the available data. Without patient data, there is no A.I. in healthcare.” - The Medical Futurist
Synthetic data provides researchers and developers with artificial patient data that’s generated with statistical realism, but does not contain protected health information, like personal identifiers.
This benefits medical researchers and developers by providing them with quality data to train models with, or to create and test medical applications, without violating the privacy laws in medicine (like HIPAA).
With the availability of this synthetic data, researchers can develop applications to support clinical decision making. For instance, synthetic data can help create a diabetes management application without the concern of privacy violations, or the time and monetary resources associated with working with sensitive data. Synthetic data is already living up to its promise of replacing sensitive information with a usable alternative.
With the inadequacy of available real-world data being an impediment for progress in health tech, the improvement and availability of synthetic data should be considered a grand milestone for the whole healthcare industry because of how it will expedite progress in this sector.
“That is where synthetic data could be of help. It can fill in the missing data, making it possible to produce entirely fabricated patient datasets that are just as useful for training A.I. as the real thing, while keeping patient data protected.”
Particle Health is taking steps to adopt and, specifically, make synthetic medical records available in both C-CDA and FHIR format. This allows players in the health tech space to access and readily develop models and applications on top of synthetic data.
Synthetic data is being used in a number of areas in medicine. The Medical Futurist provides a few examples of applying synthetic medical images to train AI models. In one example, developers leverage synthetic images to improve models that assist pathologist decisions when diagnosing brain tumors. In another, the unlimited volume and diversity of data could potentially help remove bias in detecting skin cancer.
Beyond synthetic medical imaging, synthetic medical records are also being used to assist in the development and testing of models and applications while protecting patient data. One example is Synthea, an open source tool that generates synthetic medical records for whole populations of non-real patients. The records are generated using statistical distributions of medical occurrences (such as conditions, prescriptions, and labs) resulting in a realistic output of records for a population - and in C-CDA or FHIR output!
Particle leverages Synthea to create synthetic records that are made available in our Sandbox API. Our customers actively use the synthetic C-CDA and FHIR medical records we make available in our Sandbox to develop and test their applications without using protected patient information. These synthetic records mimic real world patient data, allowing our customers to transition their applications from our Sandbox API to real patient information seamlessly.
Our customers are developing very different applications on top of synthetic data in our sandbox. Synthetic patient data accelerated their development of clinical decision support tools. We’ve also seen teams create tools for population health management, such as risk scoring and medication adherence.
"Based on existing data, the algorithm attempts to generate data that is somewhat different from the original, but not so much as to lead to a false result. So it IS fake – but it isn’t … by painting more and more of the fake ones, the painter is getting gradually better and better at creating fakes. At the same time, while going after him, the detective is also getting better at recognising those works of art that are replicas. They both keep trying to beat each other and, after many iterations, the painter creates images indistinguishable from a real Picasso. This was the goal of the whole experiment with machine learning.”
The Medical Futurist makes an analogy of a painter creating indistinguishable copies of Picasso, to a GAN model (a generative adversarial network) that creates synthetic medical images of birthmarks that are used to better train an algorithm that detects melanoma or other skin conditions.
GAN models work by learning the probability distributions in the real world, and emulating this distribution when generating synthetic data output. In the case of synthetic medical records generated by Synthea, the real world statistical distributions of medical conditions or prescriptions (for example) are used to create synthetic populations that have the same distributive characteristics. This modeling of real world distributions is what gives synthetic data its robustness.
There is a lot in the works when it comes to the healthcare industry improving upon synthetic data! New models are constantly being developed for specific applications. The Medical Futurist points out a few examples of GAN models being developed to create various forms of improved synthetic medical images. They also describe an example of using synthetic datasets to represent diverse populations, using technology to reduce bias in medical applications.
Although synthetic data is based on real world distributions, and already is very robust, statistics do not always model the natural order perfectly. Larger sample sizes of real world data can better inform the distribution of synthetic data. This is why getting as many samples as possible to capture diversity and variance is very important when modeling, and why those in the industry are striving to continually improve upon this body of work.
Synthea is continually being revised by open source contributors too, refining disease modules and statistical distributions to further improve realism.
The synthetic data user base is large enough to support industry competitions and hackathons which bring together community members to improve upon synthetic medical data. For example, the Office of the National Coordinator of Health Information Technology (ONC) hosted a national synthetic health data challenge this year, with a $100,000 prize pool, and the goal of improving upon Synthea and its use cases.
I was delighted to submit Particle Health’s entry in ONC’s competition this year (we made third place!). Our entry contributed new document types to Synthea’s C-CDA output. We open sourced this work to allow anyone to generate national network quality C-CDA documents using Synthea.
There is ample opportunity to improve upon synthetic data. However, the applications that it can support at the moment are already extensive.
Synthetic data enables innovation by making the data needed for development of medical technology more accessible. Particle Health plays a role in this process by adopting the newest and most robust forms of synthetic medical records, and making them available for use via our Sandbox API, which you can try out today.