Lessons Learned: What Big Data and Predictive Analytics Missed in 2016
In this era of the software-driven business, we’re told “data is the new oil”, and that predictive analytics and machine intelligence will extract actionable insights from this valuable resource and revolutionize the world as we know it. Yet, 2016 brought three highly visible failures in this predictive view of the world: the UK’s Brexit plebiscite; the Colombian referendum on FARC; and finally, the U.S. presidential election. What did these scenarios have in common? They all dealt with human behavior. This got me thinking that there might be lessons to be learned that are relevant to analytics.
Data, Hard and Soft
The fact that data can be noisy or corrupted is well known. The question is: how does the uncertainty within the data propagate through the analytics and manifest itself in the accuracy of predictions derived from this data? For the purposes of this article, the analysis can be statistical, game-theoretic, deep learning-based, or anything else.
There is also an important distinction between what I call “hard” data and “soft” data. This is not standard terminology, so let me define what I mean by these terms.
Hard data comes from observations and measurements of the macroscopic natural world: the positions of astronomical objects, the electrical impulses within the brain, or even the amounts of your credit card transactions. Typically, such data is objective. The observations are numerical, and the uncertainty is adequately characterized as an error zone around a central value. There is an (often unstated) assumption that the observation is trusted and repeatable (i.e., nature is not being adversarial and presenting the observer with misleading results).
Much effort has gone into designing measurement apparatus, calibration techniques, and experimental design to reduce the error zones. There is even the so-called “personal equation” to account for observer bias. And, concepts such as error propagation and numerical stability allow numerical computing and statistics to build reliable models from data with this form of uncertainty.
The robustness of such hard data analytics techniques allowed Johannes Kepler to derive his laws of planetary motion in the early 1600s from Tycho Brahe’s observations and, earlier this year, allowed astrophysicists to demonstrate the presence of gravitational waves from data, where the noise outweighed the signal by many orders of magnitude.
Soft data, in contrast, derives from observations of a social world and is typically subjective. Observations may be numerical (rank the following on a scale of 1-5) or categorical (classify the following as “agree,” “disagree,” or “neither agree nor disagree”) and are typically drawn from a sample of the entire population. And while human responses are definitely soft, other types of data may also have this characteristic. In fact, “hard” and “soft” are likely the end points of a spectrum, and we may even want to talk about the “hardness” of the data (just as we do for water – except here hardness is good).
Can Soft Responses Be Reliable? (Remembering Gregory House, MD)
Here’s the important question: Can a behavioral model derived from the soft responses of a population sample reliably predict the actions of the entire population? The sources of error (and uncertainty) are, to my mind, twofold:
- Is the sample representative of the population?
- Are the responses trustworthy? This breaks down further: Is the survey instrument itself trustworthy? And, can an individual response be trusted?
A true sample?
The problem of sample fidelity has been studied extensively in statistics, and some form of randomization is the usual solution to the problem. This generally works, but is not foolproof and is subject to challenges in today’s software-driven world.
When conducting an online-only or mobile phone survey, is a significant segment of the senior citizen demographic overlooked? Or, a socio-economic sector? Investigating spending patterns of buyers in a certain demographic (teenagers with smartphones) via mobile may be fine, but may prove unreliable when looking at voting patterns.
With an online survey where anyone can participate, how do we know we’re getting a random sample, rather than one where a particular bloc is skewing the sample by participating in disproportionate numbers? A population segment may not have easy online access, yet still turn out to vote.
Can the survey be trusted?
Assuming we’re satisfied with the fidelity of the sample, the next question is how trustworthy the survey instrument is. Is it manipulating the respondent into producing a specific response, despite appearances to the contrary? The subtlety and effectiveness of such manipulation is demonstrated with devastating (and hilarious) effect by Sir Humphrey Appleby in this video clip.
Assuming the survey is not leading the witness, the question remains whether an individual’s response to a question can be trusted. Remember Dr. House’s Credo: “Everybody lies!” This may be a gross oversimplification, but so is a model of humans based on rationality. Will employees truthfully describe the realities of their work environment before it becomes intolerable?
Can peer pressure, political correctness, herd mentality, fear of institutional consequences or the possibility of social stigma cause someone to suppress their true response? Is such behavior among a set of individuals uncorrelated (in which case, modeling this may not be important) or correlated (in which case, they may interfere constructively to swing a decision one way or another around a decision threshold)? Are these inherently jittery and chaotic?
I suspect there may be knowledge from the social sciences (economics, sociology, psychology, anthropology) or modern physics (quantum superposition, anyone?) that we need to consider and incorporate into data science.
Lessons to be Learned
Where does all of this leave us? The statistician George Box famously said, “All models are wrong, but some are useful.” However, the principle of “garbage in, garbage out” also applies since a model derived from unrepresentative data is likely to be un-predictive, as both the Brexit and U.S. Presidential election showed.
The results of Brexit, FARC, and the U.S. Presidential election are more than a data quality issue. There undoubtedly were weak signals that we failed to detect. Thus, the better we can understand the “soft” variety of big data, the better we can do in terms of predictive (and eventually prescriptive) analysis based on such data.
I am by no means suggesting that we fall back to the Pythia or to auspices ex avibus to predict the future. But we must recognize that our current state of mathematical modeling is not foolproof, and is still much more art than science. In a sense, this is good news for those in the field, as there is tremendous room to grow. But this should also teach us to be humble and cautious, and perhaps to treat claims of “sentiment analysis” and “polling experts” with a grain of salt.
About the author: Dr. Siddhartha (Sid) Chatterjee is the Chief Technology Officer of Persistent Systems, which he joined in 2015. Prior to Persistent, Sid was with IBM for 13 years during which he held multiple technical, strategic, managerial, and executive positions across IBM Research, IBM Systems & Technology Group, and IBM Global Technology Services. Sid holds a B. Tech. (Honors) degree in Electronics and Electrical Communications Engineering from IIT, Kharagpur and a Ph.D. in Computer science from Carnegie Mellon University. He has also been a visiting scientist at the Research Institute for Advanced Computer Science (RIACS) in Moffett Field, CA, an assistant and associate professor of Computer Science at the University of North Carolina, and an adjunct faculty member at the University of California, Duke University, and the University of Texas. He is an ACM Distinguished Member, an IEEE Senior Member, and a Sigma Xi member.