Big Data’s Small Lie – The Limitation of Sampling and Approximation in Big Data Analysis
Volume is the most prominent of big data’s “3 Vs.” Yet, the “big” in big data analysis is often a misnomer. Most big data analysis doesn’t look at a complete, large dataset. Instead, it looks at a subsample and works on approximations, which prevents enterprises from getting the most valuable insight from their data.
Many companies spend vast amounts of resources to collect, transform and store the massive amounts of data that flows through their business processes. However, when it comes to doing analysis and machine learning on this data, time and compute speed gate how much data can be analyzed. As a result, organizations turn to sampling because they believe they have no other choice. This is equivalent to a person investing a large amount of money for retirement, only to discover when they retire and need the money, that they can only access $200 at a time because the bank doesn’t have the resources to give them access to all of their funds.
Ironically, some people will even rationalize their lack of choice when it comes to using big data by convincing themselves and others that using more data doesn’t really get better model accuracy. This is an interesting point and those that say it are both right and wrong at the same time. However, while it is very difficult to accomplish, there are ways to overcome the time and compute barriers in order to build accurate models with all their data.
Predictive Accuracy Leads to Increased Business Value
Before we dive into why bigger data is better, it is important to understand why model accuracy matters. It matters because more accurate models drive more business value. Let’s take fraud for instance: if a company can better predict an incident of fraud, then it can stop more fraud. The concept is pretty simple, but it isn’t just about detecting fraudulent transactions, it’s also about preventing flagging something as fraudulent when it isn’t, which is also known as false positives. These false positives incur costs as well, usually in the areas of customer satisfaction and/or lost opportunities.
So, the “best” model must strike a balance between maximizing event detection and minimizing false positives. The more data available to demonstrate these outcomes and their relationship to one another, the better the predictions will become. When the models are fed a significant amount of data and optimized to learn based on past outcomes, they create higher predictive power, and thus higher business value.
How to Increase Predictive Power and Thus Business Value
A recent meme, “More data, not better models” has become popularized within the data science community. But despite its popularity, it is no longer true and illustrates a false dichotomy – more data and better models are now possible. When building a predictive model, there are several ways to improve model accuracy. Fundamentally, one can use more sophisticated models, use more data or run more experiments analyzing a large variety of models. The reality is that you need a combination of the above approaches.
A very large dataset with a simple model wastes the value of the data size after a certain point. A real example of this practice is approaching fraud detection by training simple logistic regression models with the mountains of transaction data that are available. Throwing more data at an unsophisticated (linear) model will likely not deliver better results, so the naysayers are correct in a way.
The opposite is also true. Using a more sophisticated (non-linear) model with small data will also likely deliver sub-optimal results because these models crave data and are able to detect subtle patterns that may not be present in the sampled dataset. So, the naysayers are also incorrect. Data science on big data requires a balance of sophisticated models and as much data as possible.
The last approach is to run as many experiments as possible in order to get the best model. But this last component can be very difficult, as experimentation on big data requires time and computational resources (unless one can reduce the time and computational power required to train models, which is actually possible, but we’ll save that for another day).
Why is Bigger Data Better?
There are a number of common situations in which the main data points of interest may actually constitute a tiny proportion of the entire dataset. Consequently, although the overall amount of data is very large, there might only be enough data to support the analysis of interest. The practice of subsampling, which is almost always a uniform sampling, is highly likely to miss these critical points, effectively throwing away most, if not all of the information of interest. Instead, one should collect the maximum amount of data possible for use in training predictive models. Situations like this include:
- Outliers or small clusters. These are data points that are unusual with respect to the rest of the data. Identifying them may be of interest, for example, in the detection of new forms of transaction fraud, new forms of cyber security intrusions or new types of customer behaviors.This can be generalized from individual points to small clusters. They may also be of interest because they ultimately represent a kind of systematic error or noise in the system, which may trigger a cleaning of the data, or at least a probe to further understand the phenomenon.
- Rare events or objects. These may be high-value data points of a known type, which are simply much less common than the rest of the data. For example, quasars, which are orders of magnitude less common than stars, yet key to understanding the earliest state of the universe. Another example is insurance claims. Most people don’t make claims, but the instances of claims are exactly the data points that most inform a model of claims loss. Such situations often result in highly imbalanced classification training sets.
- Rare discrete values or classes. Due to the combinatorial nature of discrete-valued variables, rare values are more acutely felt than in numeric variables. For example, in a zip code feature, most zip codes will have very few examples, and the interpolative capability that a numeric variable naturally provides must be supplanted by pooling or other manipulation of the data in the worst case. A similar issue can arise in a multi-class problem with a large number of classes, such as product type.
- Missing values. When values are not just rare but missing altogether, fewer of all the combinations of values that define the manifold of the data are available for building the model. In general, values are not missing with equal probability, making certain parts of the space very sparsely observed, if at all. In some data, the total amount of non-missing data can be significantly smaller than the amount of the whole dataset.
Business Value through Predictive Power
Seeking higher business value in a predictive problem boils down to maximizing predictive power. Bigger data increases predictive power due to foundational principles of statistics. Further, subsampling to obtain a smaller dataset for training models can be moderately to highly destructive to the results. Computation time lies at the center of maximizing predictive power, and recent computational advances have made it possible to increase predictive power, and thus business value, to greater extents than have been previously available.
About the author: Alexander Gray, Ph.D., is CTO at Skytree and associate professor in the College of Computing at Georgia Tech. His work has focused on algorithmic techniques for making machine learning tractable on massive datasets. He began working with large-scale scientific data in 1993 at NASA’s Jet Propulsion Laboratory in its Machine Learning Systems Group. He recently served on the National Academy of Sciences Committee on the analysis of massive data as a Kavli Scholar, and a Berkeley Simons Fellow, and is a frequent advisor and speaker on the topic of machine learning on big data in academia, science and industry.