Good data professionals consider the ethical obligations of working with sensitive data.
Given the aphorism that knowledge is power, data specialists (scientists, analysts, engineers) are effectively superheroes. We translate, control, channel, and deploy that power every day (hey, I don't make the rules). That understood, let us turn to the inimitable wisdom of Uncle Ben:
So... data scientists need to be responsible. But what does responsibility even look like with data science? It's numbers. It's counting things. It's the epitome of Lawful Neutral. How does responsibility even fit in, exactly? And really, how bad could it be if it didn't?
I'm going to try to stay off my linguistic and semiotic high horses here, but every dataset is a representation and an abstraction of a thing. It's a lot like photography: even the most accurate photo is still a two-dimensional, frozen image of a three-dimensional, dynamic world. This simile is useful, but incomplete, because in a lot of data cases, the photograph is the only way to see the system. So knowing how the “data photograph” introduces differences from its subject matter is vital to understanding the truth of the matter.
Furthermore, things can happen to an abstraction that can't happen to the underlying subject. Continuing to inhabit the photography metaphor, you can burn a picture of water, but burning actual water is a lot harder.
In the vast, complex systems that generate enterprise-level data, the model components can interact in unexpected ways. These interactions can be illustrative, showing where models are failing, or they can be sneaky, and subtly distort our concept of what's going on until a real-life consequence blows it all up.
As the arts have taught us, humans are beautiful, extraordinary, complex galaxies of truths and ideas, some tangible qualities, and many, many intangible ones. But digital representations of people (the uncanny valley aside - in data science we rarely deal with SecondLife-style avatars) are significantly more limited. The lack of definition on a digital person makes it easy to conflate person representations (leading to duplicate records for the same person, clogging up data resources and skewing the picture of a population) because they're not distinct enough to tell apart unambiguously.
Also, knowing that any representation will fall short of the truth, we need to choose the facets we do measure correctly. A service providing financial aid will care about every financial aspect of a person, while agencies providing social services may focus much more on someone’s documented skills and relationships with others. The same person, digitally fleshed out for two different uses, will look very different. We have to know, each time, what a "person" is in a system, so we can interpret that correctly - both for what it means and, in some ways more importantly, what it fails to convey. For example, if we ask the financial system to tell us about someone's closest relationships, it will probably tell us they are with amazon.com (because that's the relationship that matters financially) rather than the person’s closeness with their sister.
Data scientists have to know how their models differ from, and equate to, reality. They need to be aware of the ways in which a model might give unexpected answers or need additional caveats when being queried for guidance. They have to be ready to think through the likely results of adding a new interaction method or data collection point for a model to give useful warnings and solution alternatives. Instead of someone whispering, "Remember, thou art mortal," to us every day, we have a tiny internal chant of, "the map is not the terrain, the map is not the terrain!"
The exchange pictured here is an example of the metric not matching the meaningful content. It's easy to spot that “ten jam” is not meaningful. In a significantly complex, peopled model, we may struggle more with how to ensure that the data concept we've decided to count has real world significance.
A convenient example of this is modeling engagement and accuracy of help articles by how many times they're clicked. Calling a click a successful engagement is measuring the inches of carrot juice without dimensioning the cylinder it's filling. To get the right quantity, we need to know that the articles actually helped. We can't know that from just whether or not they were suckered into clicking it by the headline (the author would like to note that this characterization may come from a nonzero amount of personal experience and attendant bitterness). We need more dimension, often chronological, to determine whether the interaction was successful according to our desired outcome: helping the user help himself.
Just counting things is pretty much never enough. A number in isolation is not informative. At the very least, a number needs the contextualization of a denominator to help us understand how it moves us toward or away from our goals. Those goals are contextually defined by our business, whatever that business is.
Many of the meaningful metrics have to be composites of several individual figures, related in specific ways. In systems coming to life every day for different purposes, we have to figure out what the most meaningful metric (raw or composed) is, and how that metric should be interpreted.
There's another facet to correct measurement: what to leave out. In the course of normal human events, human bias creeps into the historical record, including into the data we use to assess and predict. This gets especially dangerous when we consider the quasi-mythological objectivity it is tempting to ascribe to machines. Bizarrely, the same data in human hands is used to demonstrate gross injustice, but when we give it to a machine, we expect it to yield an even-handed predictive engine.
Knowing what we know both about the historical record and the way people often respond to machines' assertions, data scientists need to walk a careful, responsible path in which we preserve the utility of a system while removing the landmines of human bias wherever possible.
The biggest data-science-run-amok example I can think of is the predictive engine used in some parts of the US that is supposed to predict recidivism (COMPAS). Guess what the number one indicator of likely recidivism is? If you said race, you're wrong. It's age, which was 2.5 times more likely to generate a likely-to-commit-more-crimes outcome. Race was "only" 45% more likely to generate a higher risk of recidivism. When looking at the highest risk pool (as labeled by the algorithm), black defendants were 3 times as likely to be mislabeled “high risk” than white defendants.
When incorporating race data into an algorithm, you are inviting it to influence the algorithm as much as it influenced the training data, which is to say, a lot, because humans made that data, and they are influenced by race a lot. "So take out race as a factor," you suggest, sensibly, and I respond, "Sure, that's a great idea!" but really doing that requires more than just removing the checkboxes that are labeled "White" and "Black."
It turns out that, in large part due to redlining, which was practiced without significant legal constraint until 1977, zip code is a distressingly accurate proxy for race. In fact, the Consumer Financial Protection Bureau specifically uses a composite race prediction from last name and geography to determine whether or not lenders are abiding by the Equal Credit Opportunity Act. That seems like a kind of white-hat use of this knowledge, but it's not hard to think of an example in which the same technique is not being used for virtuous ends. When we use data to draw conclusions or make predictions, we have to think of and look for proxies, so that we're not letting terrible human choices become terrible mechanical decrees.
In the physical sciences, there is a concept of repeatability where, using the same equipment in the same context, I should be able to see exactly what you do during an identical experiment. In data science, we're more into reproducibility, which is more like us each building our own tools, looking at the same phenomenon, and getting the same counts or concepts out of it.
Because we are often defining what is measurable and building the tools to measure them, we use those differences in our methodology to figure out where we have made different assumptions about the nature of life, the universe, and everything, and how they have affected our outcomes.
We can engage in several tiers of acceptance testing, depending on the project need. While I have used code review and spot testing, the most in-depth and important way to ensure confidence in analyses is by reproducing work.
At the outset of the project, you define the terms and desired output. Then, once you answer the question, someone else tries to get similar results. You’ll reconvene and compare notes to figure out why the numbers aren't the same (because they pretty much never are). This is a simplistic outline of how this happens; it can take several to many days, depending on the complexity of what we're trying to accomplish or answer.
Reproducing work is time- and brain-consuming to do, but I think it is the right thing to make sure that we have the best-explored and most rigorously-defined components mapped out in the end.
In each of the examples above, there's an implied or articulated consequence to not engaging in the most responsible process.
Generally speaking, there are two big pitfalls around giving answers from a data-rich system. You have to avoid mistaking or forgetting:
Because of the larger context in which all this work takes place, even small projects can have outsized repercussions. The world is very busy. It creates a lot of data. As a result, many people rely on the work done by others. We all stand on the shoulders of giants of course, but we also sort of... crowd surf our way to answers. Often.
There doesn't have to be anything wrong with that, but the snowball effect of one wrong conclusion rolling downhill through other people's work is the risk that needs to keep us on the straight and narrow. As more and more organizations turn to data-driven decisionmaking, that one wrong conclusion gains real world consequences. Folks use code that looks like it works, or worse yet, corpora (training data) that look good on the surface. They don't necessarily have the resources to investigate it thoroughly or build their own from scratch. That's why it's every data person's job to think critically not only about how we investigate a given problem, but how our tools and conclusions will influence other data consumers.
Maybe not Spiderman-Spiderman, but we can all agree that Uncle Ben's advice is solid and should be taken to heart by anyone involved in the intricate, enormous task of turning data into knowledge, because that's when it becomes powerful. In health tech, and everywhere, we should consider, consume, and interpret our data carefully.
Further Reading