Big Data Meltdown: How Unclean, Unlabeled, and Poorly Managed Data Dooms AI
We may be living in the fourth industrial age and on cusp of huge advances in automation powered by AI. But according to the latest data, our great future will be less rosy if enterprises don’t start doing something about one thing in particular: the poor state of data.
That’s the gist of several reports to make the rounds recently, as well as interviews with industry experts. Time after time, the lack of clean, well-managed, and labeled data was cited as a major impediment for enterprises getting value out of AI.
Last month, Figure Eight (formerly CrowdFlower) released a study about the state of AI and machine learning. The company, which helps generate training data for customers, found a decided lack of data ready to be used to train machine learning algorithms.
The study found that only 21% of respondents indicated that their data was both ready for AI (that is, it’s organized, accessible, and annotated) and is being used for that purpose. Another 15% report their data is organized, accessible, and annotated, but it’s not being utilized, or it’s being used for other business purposes.
Alegion, which is also in the data labeling business, released its own study that came to a remarkably similar conclusion. The study also found that data quality and labeling issues had negatively impacted nearly four out of five AI and machine learning projects.
“The nascency of enterprise AI has led more than half of the surveyed companies to label their training data internally or build their own data annotation tool,” the company stated. “Unfortunately, 8 out of 10 companies indicate that training AI/ML algorithms is more challenging than they expected, and nearly as many report problems with projects stalling.”
Bad data is nothing new to enterprises. For decades, “garbage in, garbage out” has been the rallying cry of IT professionals who stressed to users the importance of accurate data entry. DBAs would spend hours building referential integrity into their relational databases, which eliminated some gibberish from entering the record.
Further up the steak, data warehouse architects would spend millions building elaborate data models and enforcing master data management (MDM) standards. Companies spent huge sums of money during the height of the data warehousing era ensuring their transactional data was clean and normalized before loading them into MPP databases for analyses.
But the modern world is a different place. Today, 12 years into the Apache Hadoop experiment and in the midst of the great cloud migration, it should come as no surprise that getting clean and well-managed data still isn’t easy. In fact, it’s even harder now that much of the data is less structured than what enterprises were dealing with at the turn of the century.
“We know that a large percentage of the world’s data are just incorrect,” says James Cotton, who is the international director of Information Builders‘ Data Management Centre of Excellence. “These data quality errors come from all sorts of places. The problem is, once we apply them to AI, regardless of the ethical questions about how an AI should handle them, the AI is just going to make bad decisions at scale.”
Mucking Up the AI Gears
The level of hype surrounding AI may belie its actual utility at the moment. More than eight out of 10 big data projects are failures, Gartner analyst Nick Heudecker said last year. According to a 2018 study on AI from PwC, only 3% of AI implementations have actually been implemented and are generating a positive ROI. Ali Ghodsi, the CEO and co-founder of Databricks, the commercial vendor behind Apache Spark (and a 2019 Datanami Person to Watch), dubbed this “AI’s 1% problem.”
It’s easy to get sidetracked by the rapid pace of technological change, but it’s more important to focus on the data, says Chris Lynch, the CEO of AtScale, which develops an OLAP-style query engine that can deliver insight from big data repositories residing on-prem and in the cloud. “Everyone is running around [saying] ‘We’ve got this magic model, we’ve got this magical algorithm,'” Lynch tells Datanami. “Models and algorithms are the commodity. Data is what’s unique.”
Today, companies are stockpiling data in the hopes of using it for training machine learning algorithms. But the data quality issues that companies face today haven’t changed much over the past 20 years, says Trifacta CEO Adam Wilson.
“I’ve been in and around this market for the last 20 years and I’d say the data quality challenges that customers faced 15 to 20 years ago, even when the world was mostly transactional and the data was mostly structured — a lot of that is as true today as it was then,” Wilson tells Datanami.
“The rise of AI and machine learning have only highlighted the importance of this again for folks,” he continues, “because now, if you’re starting to talk about automating decision-making and they’re using AI, the last thing you want to do is automated bad decisions faster based on bad data. So it has certainly amplified the need for data quality.”
The good news is that companies may be starting to get the message. Wilson says we’re in the midst of three mega-trends – the rise of ML and AI, a shift to self-service BI, and the massive cloud migration — that are happening at the same time.
“Any one of those trends on their own would potentially be disruptive. The fact that they’re all happening simultaneously is really causing organizations to step back and completely rethink their strategy,” the Trifacta CEO says. “It really points to the size of the opportunity that’s ahead of us and how motivated the customers are to really make meaningful progress here and not say ‘Hey, let’s be more data driven,’ but to actually get the data in the hands of the people who know best.”