Follow Datanami:
March 14, 2016

Dangers in Data Definition – How Data Quality Can Save Lives

Aaron Kalb


Thought leaders from Silicon Valley to Washington to DC are rightly excited by the prospect of a future in which data science, statistics, and analytics play a larger role in the the shaping of public policy and discourse. But before computers can process large volumes of data to identify correlations, perform classifications, or generate predictive models, humans need to furnish them with appropriate data sets and understand the meaning of the raw data. A recent incident shows how even the most basic data-based calculation can be totally off given the challenges of finding and understanding useful data.

A Fatal Discrepancy

Amid all the recent debate regarding police, violence, and racial bias, it has emerged that national or even regional statistics on police behavior are not readily available. How often do police harm unarmed civilians? Does the race of the officer or the victim affect the outcome of an incident? Whenever a precinct, DA’s office, or the department of justice is asked whether there’s evidence of systematic bias, the answer is typically “we just don’t have the data.”

This opacity around police statistics cuts the other way, as well: it’s similarly unclear how often police are injured or killed in the course of pursuing suspects.



USA Today recently found that the “FBI vastly understate[d] police deaths in chases” by a factor of 15. In other words, the FBI’s study “missed” 93% of cases. How can we account for such a large discrepancy, especially from an organization with the resources and brainpower to catch Al Capone (and Hannibal Lecter)? Reported numbers can sometimes be too high due to double-counting, while underestimates can often occur when aggregators miss an important dataset. Presumably the FBI could handle going through a checklist of all the police departments nationwide, so why the large difference?

In this case, the issue seems to be semantic, a question of how to define “death caused by car chase.” USA Today found that “only in the rare instances that a fleeing driver directly causes an officer’s death—usually by ramming a police car or forcing it off the road—does the FBI say an officer died ‘engaging in vehicle pursuit.’” That’s why the FBI reported only 24 such deaths between 1980 and 2014, whereas USA Today, with a broader definition of “death caused by car chase”, identified 371 deaths— a non-trivial percentage of the 5,868 officer fatalities during those 35 years.

Today, law enforcement officials are debating changes to the recommendations about when and how police may pursue suspects in high-speed car chases. Suppose a more conservative policy were estimated to reduce police deaths in car chases by 50%, but would allow a certain number of dangerous criminals to run free. Would said changes be justified to spare the lives of 12 police officers (based on the FBI’s count)? How about to save 185 lives (given the more complete USA Today number)?

Discrepancies in data definitions—and the resulting data interpretations—can have serious implications.

Better Data Definitions Yield Better Decisions

Today, many governments are focused on “open data” initiatives: uploading spreadsheets to public repositories or making numbers available via APIs. But numbers without context are meaningless—often incomprehensible or easily misunderstood. Rich metadata on the meanings and usage of the data assets is required to do accurate analysis.



In the future, government agencies and private citizens should be able to easily look up statistics for different definitions of police fatalities, racial bias, or any other phenomenon. Once the data set is fully understood, agencies and individuals can use that knowledge to make a more informed decision, or at least engage in more nuanced discourse.

In addition to establishing well-informed policies around police pursuit of suspects (e.g. for which crimes, at which times of day, at what speed?), open-and-understandable data might help answer other tough questions arising from today’s jarring headlines: How often do police shoot innocent civilians? How often does a suspect actually cause someone harm when an officer restrains from firing? How does the race of the officer or the victim change the odds? Are there any training programs that reliably reduce racial bias in use of force in the field?

The value of open-and-understandable data can extend far beyond the realm of law enforcement. Policy makers and voters would, for example, be able to do their own research into when and for whom charter schools produce better outcomes, or where someone could receive optimal healthcare, or whether “sharing economy” companies wind up helping or hurting low-income residents of cities, or other topics. With open-and-understandable data, everyone from data scientists to everyday citizens can leverage actual evidence to make our society measurably better.

aaron kalv

About the author: Aaron Kalb has spent his career crafting empowering human-computer interactions, especially through natural language interfaces. After leaving Stanford with a BS and an MS in Symbolic Systems and working at Apple on iOS and Siri (doing engineering, research, and design in the Advanced Development Group), Aaron now leads the design team and guides the product vision at Alation