July 20, 2015

Big Data’s Small Lie – The Limitation of Sampling and Approximation in Big Data Analysis

Alexander Gray

Volume is the most prominent of big data’s “3 Vs.” Yet, the “big” in big data analysis is often a misnomer. Most big data analysis doesn’t look at a complete, large dataset. Instead, it looks at a subsample and works on approximations, which prevents enterprises from getting the most valuable insight from their data.

Many companies spend vast amounts of resources to collect, transform and store the massive amounts of data that flows through their business processes. However, when it comes to doing analysis and machine learning on this data, time and compute speed gate how much data can be analyzed. As a result, organizations turn to sampling because they believe they have no other choice. This is equivalent to a person investing a large amount of money for retirement, only to discover when they retire and need the money, that they can only access $200 at a time because the bank doesn’t have the resources to give them access to all of their funds.

Ironically, some people will even rationalize their lack of choice when it comes to using big data by convincing themselves and others that using more data doesn’t really get better model accuracy. This is an interesting point and those that say it are both right and wrong at the same time. However, while it is very difficult to accomplish, there are ways to overcome the time and compute barriers in order to build accurate models with all their data.

Predictive Accuracy Leads to Increased Business Value

Before we dive into why bigger data is better, it is important to understand why model accuracy matters. It matters because more accurate models drive more business value. Let’s take fraud for instance: if a company can better predict an incident of fraud, then it can stop more fraud. The concept is pretty simple, but it isn’t just about detecting fraudulent transactions, it’s also about preventing flagging something as fraudulent when it isn’t, which is also known as false positives. These false positives incur costs as well, usually in the areas of customer satisfaction and/or lost opportunities.

So, the “best” model must strike a balance between maximizing event detection and minimizing false positives. The more data available to demonstrate these outcomes and their relationship to one another, the better the predictions will become. When the models are fed a significant amount of data and optimized to learn based on past outcomes, they create higher predictive power, and thus higher business value.

How to Increase Predictive Power and Thus Business Value

A recent meme, “More data, not better models” has become popularized within the data science community. But despite its popularity, it is no longer true and illustrates a false dichotomy – more data and better models are now possible. When building a predictive model, there are several ways to improve model accuracy. Fundamentally, one can use more sophisticated models, use more data or run more experiments analyzing a large variety of models. The reality is that you need a combination of the above approaches.

A very large dataset with a simple model wastes the value of the data size after a certain point. A real example of this practice is approaching fraud detection by training simple logistic regression models with the mountains of transaction data that are available. Throwing more data at an unsophisticated (linear) model will likely not deliver better results, so the naysayers are correct in a way.

The opposite is also true. Using a more sophisticated (non-linear) model with small data will also likely deliver sub-optimal results because these models crave data and are able to detect subtle patterns that may not be present in the sampled dataset. So, the naysayers are also incorrect. Data science on big data requires a balance of sophisticated models and as much data as possible.

The last approach is to run as many experiments as possible in order to get the best model. But this last component can be very difficult, as experimentation on big data requires time and computational resources (unless one can reduce the time and computational power required to train models, which is actually possible, but we’ll save that for another day).

Why is Bigger Data Better?

There are a number of common situations in which the main data points of interest may actually constitute a tiny proportion of the entire dataset. Consequently, although the overall amount of data is very large, there might only be enough data to support the analysis of interest. The practice of subsampling, which is almost always a uniform sampling, is highly likely to miss these critical points, effectively throwing away most, if not all of the information of interest. Instead, one should collect the maximum amount of data possible for use in training predictive models. Situations like this include:

Outliers or small clusters. These are data points that are unusual with respect to the rest of the data. Identifying them may be of interest, for example, in the detection of new forms of transaction fraud, new forms of cyber security intrusions or new types of customer behaviors.This can be generalized from individual points to small clusters. They may also be of interest because they ultimately represent a kind of systematic error or noise in the system, which may trigger a cleaning of the data, or at least a probe to further understand the phenomenon.

Rare events or objects. These may be high-value data points of a known type, which are simply much less common than the rest of the data. For example, quasars, which are orders of magnitude less common than stars, yet key to understanding the earliest state of the universe. Another example is insurance claims. Most people don’t make claims, but the instances of claims are exactly the data points that most inform a model of claims loss. Such situations often result in highly imbalanced classification training sets.

Rare discrete values or classes. Due to the combinatorial nature of discrete-valued variables, rare values are more acutely felt than in numeric variables. For example, in a zip code feature, most zip codes will have very few examples, and the interpolative capability that a numeric variable naturally provides must be supplanted by pooling or other manipulation of the data in the worst case. A similar issue can arise in a multi-class problem with a large number of classes, such as product type.

Missing values. When values are not just rare but missing altogether, fewer of all the combinations of values that define the manifold of the data are available for building the model. In general, values are not missing with equal probability, making certain parts of the space very sparsely observed, if at all. In some data, the total amount of non-missing data can be significantly smaller than the amount of the whole dataset.

Business Value through Predictive Power

Seeking higher business value in a predictive problem boils down to maximizing predictive power. Bigger data increases predictive power due to foundational principles of statistics. Further, subsampling to obtain a smaller dataset for training models can be moderately to highly destructive to the results. Computation time lies at the center of maximizing predictive power, and recent computational advances have made it possible to increase predictive power, and thus business value, to greater extents than have been previously available.

About the author: Alexander Gray, Ph.D., is CTO at Skytree and associate professor in the College of Computing at Georgia Tech. His work has focused on algorithmic techniques for making machine learning tractable on massive datasets. He began working with large-scale scientific data in 1993 at NASA’s Jet Propulsion Laboratory in its Machine Learning Systems Group. He recently served on the National Academy of Sciences Committee on the analysis of massive data as a Kavli Scholar, and a Berkeley Simons Fellow, and is a frequent advisor and speaker on the topic of machine learning on big data in academia, science and industry.

Related Items:

Big Data Outliers: Friend or Foe?

Big Data Outlier Detection, for Fun and Profit

Inside Sibyl, Google’s Massively Parallel Machine Learning Platform

Applications: Artificial Intelligence, Data Mining, Predictive Analytics

Technologies: Middleware

Sectors: Financial Services, Healthcare, Retail

Vendors: Skytree

Tags: big data, machine learning, sampling

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Big Data’s Small Lie – The Limitation of Sampling and Approximation in Big Data Analysis

Predictive Accuracy Leads to Increased Business Value

How to Increase Predictive Power and Thus Business Value

Why is Bigger Data Better?

Business Value through Predictive Power

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Big Data’s Small Lie – The Limitation of Sampling and Approximation in Big Data Analysis

Predictive Accuracy Leads to Increased Business Value

How to Increase Predictive Power and Thus Business Value

Why is Bigger Data Better?

Business Value through Predictive Power

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link