When Big Data Becomes Too Much Data
About 2.5 exabytes of data will be generated today, or roughly the amount of data that was generated from the dawn of time until 2004. What’s in there, and will any of it be useful? The reality is the amount of data is so vast, its quality so dubious, and our abilities so relatively weak that most of it will have no impact whatsoever.
In a perfect world, each additional byte of data we generate and absorb would shave a little bit more uncertainty away from our models and help us get closer to truth as it exists in nature. But our world is not perfect, and neither are our models. And sometimes, big data doesn’t so much inform us as it clouds our understanding and ability to make good decisions. That’s when big data morphs into too much data.
There is no doubt that organizations today are interested in big data analytics. Business leaders read big data success stories and then fire up their own projects. While there is the potential to benefit from the massive data flows and new analytics technology like Hadoop and Spark, the big data road is dotted with pitfalls. “Big data is tricky. It can help or hurt your analysis, depending on how you use it,” Philip Kegelmeyer, a senior scientist at Sandia National Labs in Livermore, California, said in a recent issue of Sandia Labs News. “…[B]ig data can be powerful, but only if you understand the inherent weaknesses and tradeoffs. You can’t just take data at face value.”
According to Kegelmeyer, one way big data can trip you up is the potential to magnify errors in logical reasoning. The base rate fallacy, in particular, is a trap waiting to happen. This occurs when you inadvertently introduce the wrong type of data (or at least a data set that is unrelated to the first) and attempt to draw conclusions from it.
Humans are very prone to making base rate errors when following a line of reasoning, and this is one area where big data analytics–with its heavy emphasis on mixing different data sets in pursuit of the discovery of correlations–is particularly vulnerable. A potential antidote to base rate errors is the effective use of Bayes Rule, which relates to probability theory and can help shield prospective big data analysts from making poor logical choices.
More information is not always better information. In fact, the average person’s track record in making good decisions when presented with more information is alarmingly poor. Take, for example, the famous 1998 study “On the Pursuit and Misuse of Useless Information,” by Princeton and Stanford University psychologists Anthony Bastardi and Eldar Shafir.
In one of the key experiments in the study, two groups of people were asked to judge the creditworthiness of a recent college graduate who was seeking a mortgage. The applicant had a good job and a good credit score, but had not made any payments on $5,000 in credit card debt for the past three months.
The first group was presented that information in a straightforward manner and then asked to approve or reject the loan. The second group were presented with the same information, but with one caveat: Instead of knowing the size of the debt, they were told it was either $5,000 or $25,000. The group was given three choices approve the application, reject the application, or wait for more information on the amount of the debt. Most of the people in the second group chose to wait for more information.
When the second group was informed that the debt was, in fact, $5,000, 79 percent of them approved the loan. By contrast, only 29 percent of the first group approved the loan. Apparently, when presented with random information–the possibility that the debt could be 5 times bigger than it actually was–its proximity to real data affected the decision-making of the group.
While this study is more than 15 years old, it tells us something quite pertinent about how people process information that they actively pursue. “Decision makers often pursue noninstrumental information–information that appears relevant but, if simply available, would have no impact on choice,” the researchers conclude. “Once they pursue such information, people then use it to make their decision. Consequently, the pursuit of information that would have had no impact on choice leads people to make choices they would not otherwise have made.”
We don’t have a choice in whether we deal with big data. Big data is already here, and it’s going to get bigger—much, much bigger—whether we like it or not. We are tantalized by the stories of organizations like Google, Facebook, Yahoo, and Netflix that have harnessed big data and made it work for them. And it’s normal that we emulate those examples and seek to build similar big data teams to help us tackle big data too.
But at the same time it’s important not to get too carried away by the promise of big data. While there is value within data, separating the useful signals from the surrounding noise gets harder as the data sets get bigger. To avoid the pitfalls, start small with your data analytics projects, ensuring you’re finding real value, and build from there.