Open-Sourcing Synthetic Clinical Data for Innovation

Introducing Particle Health's open-source repository of synthetic patients in CCDA format. This dataset includes five diverse, clinically relevant patient profiles ideal for demos, development, and testing—making clinical data more accessible for all.

Today, Particle Health is open-sourcing a Github repository with synthetic clinical data in CCDA format.  

At Particle, we specialize in working with CCDA XML files to extract critical clinical data, empowering our customers and ultimately benefiting patients. While CCDA is an older format, it remains the primary standard for sharing clinical data across health information networks.

A common challenge when working with CCDA files is the need for synthetic data that clearly indicates the patient’s medical conditions and helps identify where to locate this information within the document. To address this, we collaborated with clinical experts to create five synthetic patients in the CCDA format. These datasets are designed to showcase a variety of clinical use cases and cover multiple disease areas. Using synthetic CCDA data removes privacy concerns, as no real patient information is involved. This ensures that organizations can innovate and develop healthcare solutions without violating HIPAA or other privacy regulations. Synthetic data also eliminates the need for anonymization, which can be costly and time-consuming when dealing with real data.

We initially developed this dataset for internal use during a hackathon a few months ago, and it has proven valuable across several areas: 

  • Customer Use Cases: Identifying clinical use cases that our customers can solve for.
  • Demos: Showcasing our products with clinically relevant data, now available in our sandbox environment. 
  • Feedback: Gathering input on new features, like AI-powered summarization.
  • Development: Supporting software testing and refinement.  

Given how useful it has been for us, we believe it could be equally beneficial to others. 

The repository contains C-CDA XML files for five synthetic patients, each representing different clinical conditions, along with AI-generated summaries of their medical histories. To make them easier to remember, the patients are named mnemonically:

Here is the link to the repository. The data is licensed under Creative Commons ShareAlike Attribution License.