Follow Datanami:
May 23, 2019

Faulty Data is Stalling AI Projects

Tens of billions will be spent this year on AI development, but those efforts continue to be stymied by ratty data that has undermined model training efforts and burned through project budgets.

That’s the sobering conclusion of a vendor survey of data scientists, AI technologist and business executives that uncovered widespread problems with data quality, specifically data labeling required to train AI models. The result is that most AI projects are stalled, with little to show for early and substantial investments.

The survey released Thursday (May 23) by AI training data specialist Alegionfound that despite heavy investment in focused AI and machine learning projects (most respondents said they have four or fewer projects in development), 78 percent of those projects have slowed at some stage before deployment.

The primary reason is data quality and labeling challenges, prompting many early movers to either develop an in-house solution or outsource data labeling needed to transition machine learning projects to production.

“The nascency of enterprise AI has led more than half of the surveyed companies to label their training data internally or build their own data annotation tool,” the survey found. “Unfortunately, 8 out of 10 companies indicate that training AI/ML algorithms is more challenging than they expected, and nearly as many report problems with projects stalling.”

Hence, survey commissioner Alegion, a specialist in crowdsourcing machine learning steps like data labeling, emphasizes that 71 percent of development teams ultimately outsource those project activities.

Along with a lack of prepped data and the human resources needed to accurately label data sets, two thirds of respondents cited bias or errors in data as the biggest challenge in training their AI models.

In addition to data quality, the survey also sheds light on data quantities required to achieve confidence in AI models. On a scale of between 100,000 and more than 10 million data points, 43 percent of respondents said they require up to 1 million data items to achieve “production-level model confidence.” Meanwhile, 72 percent said model confidence would require labeled data totaling more than 100,000 items.

Coming up with that much labeled data has proven problematic, prompting 81 percent of those polled to conclude that the training AI models has turned out to be more difficult than expected.

Those unforeseen consequences are spawning a cottage industry of data labeling specialists like Alegion to help fill the gaps as AI first movers struggle to get machine learning projects off the ground. The Austin-based company pitches a platform that integrates machine intelligence to scale data labeling efforts and convent raw data into “model-ready training data.”

There appears to be growing demand for such tools as projected AI spending soars. According to market tracker IDC, global AI investments are expected to more than double through 2022 to $79.2 billion. U.S. companies currently account for two thirds of the estimated $35.8 billion in AI spending this year, mostly by the banking and retail sectors.

Recent items:

Training Data: Why Scale is Critical for Your AI Future

Developers Will Adopt Sophisticated AI Model Training Tools in 2018