Solving One of ML and AI’s Biggest Challenges: Exploratory Data Analysis
As organizations invest in Artificial Intelligence (AI) and Machine Learning (ML), they inevitably run into a paradox. The automation that ML models offer can be a critical advantage for many businesses. In theory, automation should drive down labor costs, but with each additional model comes a maintenance overhead burden many organizations underestimate.
While a lot of emphasis gets put on model building, data scientists must continually explore and analyze the underlying data for shifting patterns, data quality issues, and unforeseen changes to the business, which will erode the performance of even the most sophisticated pipelines. And it’s in this exploratory data analysis that things often go wrong. The work is often done manually and must be done frequently to catch potential changes in the data, which is time-consuming and arduous.
Organizations underestimate just how much effort will need to be spent on exploratory data analysis and as such, it often simply doesn’t get done. Ultimately, most AI and ML initiatives fail because stakeholders don’t catch and remediate data issues. One survey puts the failure rate at 87%.
The Exploratory Data Analysis Problem
The prudent scientist must interrogate the data with a laundry list of statistical questions to determine the data’s fit-for-use in AI and ML projects. Are there enough data points? How many data points are good and actually usable? Are there missing values or bad values? Are there any outliers? And on and on, each question branching out into a multitude of questions that need to be answered to preempt most if not all cases that could derail the project.
Worst of all, this process isn’t a one-and-done activity! The business processes generating the data are inherently dynamic. As a result, a significant amount of time must be spent revisiting the data pipeline to ensure the data feeding into ML models are consistent. Airbnb found that nearly 70% of the time that data scientists spend developing ML models is allocated toward this crucial and thorough but lengthy exploratory analysis. This oceanic wave of unchecked overhead can easily drown out the automation benefits we strive to achieve with our projects.
Conventional data testing can help, but the usual test coverage won’t catch important changes in the data intimately related to your business. For instance, tests won’t shed light on the “unknown, unknowns.” If you aren’t aware that an issue could occur, then it is impossible to create a test to catch that issue. As the saying goes in boxing, “it’s the punch that you don’t see coming that knocks you out.”
Throwing more data scientists at the problem is rarely the answer for most companies. It isn’t cost-effective, and data scientists, and everyone else for that matter, simply don’t like this kind of work. Few people became data scientists to perform repetitive data cleaning. On top of that, adhering to the basic principles in a dynamic big data world is not scalable and can lead to burnout. As a lightning rod paper from Google puts it, “everyone wants to do the model work, not the data work.”
Enter Data Monitoring and Anomaly Detection
John Tukey, one of the greatest Mathematical Statisticians ever lived, wrote extensively about the teaching and practicing of data analysis, and his lessons apply today in the Big Data age. Companies will benefit from the spirit admitting, “I don’t know what is in the data, and I will strive to learn from it.”
Understanding the data is the key to AI and ML success, not a vast repertoire of algorithms.
But if the work needed to understand the data is too time-consuming and costly, what can data scientists do? One solution is a combination of automated data monitoring and anomaly detection.
Data monitoring repeatedly collects various statistics from tables and their columns, creating time series for things like freshness, row count, and a bevy of other issues that can affect the data itself. Anomaly detection on the resultant time series identifies events that should raise suspicion, helping to find degradation or surprising changes in the data serving your AI and ML processes, which in turn helps prevent bad data-driven decisions.
With automation to do the heavy-lifting of the essential but tedious data monitoring task, rather than constantly needing to explore the data, the scientist has a clear picture of the data and whether it is behaving as expected at all times. When an issue emerges, the scientist must investigate and figure out what is actually happening, but they have much more time to do the modeling work. Now, they can focus on the actual business problems rather than data problems.
AI and ML Demand a Deep Understanding of the Data
The success of your AI and ML initiatives more often than not rests on the data rather than the algorithms. As the saying goes, “garbage-in, garbage out.” But few organizations consider just how time-consuming this data work really is and how often it will need to be done — and as a result, AI and ML projects stall. Dynamic data monitoring and anomaly detection can remove much of this exploratory data analysis work, helping data scientists understand the data better, and enabling them to spend more time on the models.
About the author: Henry Li is a data scientist working on developing anomaly detection and forecasting solutions for intelligent infrastructure and high-dimensional data problems. Henry is currently a senior data scientist at Bigeye. Previously he was a data scientist at Uber where he was a data scientist focused on intelligent decision systems.