Particle’s engineers explain how we take steps to munge, or standardize, data for the benefit of our users.
When Particle Health finds a clinical document, we get ready to munge it! Data munging, which entails transforming data to rid it of noise and formatting errors, is an issue as old as computing itself.
The wide variety of everything in the US healthcare system makes data munging necessary, because even seemingly basic data has small differences at scale. Making data work, no matter how it was initially inputted, is critical when extracting insights from disparate data sets.
We’ll let our engineers explain some of the many areas in which munging comes up, and how it benefits Particle users.
Data munging involves taking disorganized data, and fitting it to an organized, standardized library of terms so that it can be ingested and used to find insights.
For Particle Health, the data we find lives on disparate EHR systems in a way that makes sense for those systems individually. Each hospital organizes their long XML C-CDA files slightly differently. Things aren’t always mapped to standard ICD or LOINC terms. Some have different headers for the same information, but there are hundreds of ways, and we might get multiple files per patient.
When data munging comes into play is when we aggregate those files. It supports client analytical models when they can see in uniform fashion how things like respiratory rate have changed over time; or when they can exclude extraneous information like height that shouldn’t be a part of the record the client is looking for.
Data munging can be simple, but powerful. When it’s applied in the right place, it can help a business see things differently, or an engineering team leap ahead.
Here’s an easy example: our team found a website with hundreds of endpoints for health information. This would be a very valuable resource, except that the website showed those endpoints on separate cards. To use this information we would have to manually go through each card and copy/paste it into a spreadsheet. That’s a lot of work!
To munge the data, I put together a browser console Javascript to retrieve the information for each card in the website HTML and output it in a way that can be imported into a spreadsheet. The resulting document helped us better understand the scope of the business opportunity of incorporating these endpoints.
Munging also powers more complex quality analysis. I munged an anonymized healthcare dataset for simple patient de-duplication (think ARTUR instead of Artur). The output allowed us to distinguish patients without accessing any sensitive data. This helped us incorporate new ways to monitor our platform’s health with patient-centric metrics.
Munging is a combination of cleaning and shaping data. It’s what makes a multi-document corpus into a consistent representation of the attributes our use case calls for. It’s aggregating, cleaning, and refining data. This requires contextual understanding of what attributes our use case needs, and how to represent them.
Consider what we do if a document shows no data. Do we use a zero, a none, or nothing at all? Different applications of the corpus will require different answers, as will the nature of the features (numeric, categorical, etc.). Basically, representing what is missing is part of the design decisions that go into munging of any kind.
We avoid design choices that will require us to remove data. We’ve helped some customers add information, but essentially, we’re always looking for ways to minimize the application of our judgment to the content of the data.
On the cleaning side, where we are looking but not yet developing, munging includes normalization. I come from a text background and actually had this problem in another job: medication spellings. One source could spell Adderall with Ls, another could use one L, and another could spell it in lower case! Munging brings those things together so that they better represent the human concept in the generated data. There’s no point in counting a prescription for adderall and one for Adderall separately, but in order to let the machine know they’re the same, we have to munge!
We want something that is going to be more easily, and rapidly digestible. In other words, it’s like winnowing the wheat from the chaff.
Data munging helps us manage data from different sources and make it usable. It allows our clients to plug standardized data into their databases and analytical models - and easily find that data they think is important.