Follow Datanami:
July 1, 2014

Survey: Variety, Not Volume, Stymies Data Scientists

Diverse data types, not just the volume of data, is the biggest challenge faced by data scientists, according to a new survey of big data practitioners.

One consequence, warns the survey by computational database specialist Paradigm4, is that data variety is causing frustrated scientists to “leave data on the table.” Of the 111 data scientists responding to the survey, fully 71 percent said big data made analytics more difficult. Data variety rather than volume was most often cited as the primary reason.

The survey also found that 36 percent of respondents said it takes to much time to glean insights from data sets that are too big to move to analytics software. “The increasing variety of data sources is forcing data scientists into shortcuts that leave data and money on the table,” Paradigm4 CEO Marilyn Matz, said in a statement releasing the survey findings.

“The focus on the volume of data hides the real challenge of data analytics today. Only by addressing the challenge of utilizing diverse types of data will we be able to unlock the enormous potential of analytics,” Matz argued.

The Hadoop platform also came in for some hard knocks from data scientists. The survey found that 48 percent of respondents have used Hadoop or its processing engine, Spark. Of those, 76 percent said it was too slow, required too much effort to program or had other limitations.

Nearly half of respondents complained it is becoming harder to fit their data into relational database tables. “Incorporating the diverse data types into analytical workflows is a major pain point for data scientists using traditional relational database software,” the survey warned. Among the consequences was that 39 percent of those surveyed reported more job stress.

For complex analytics, the survey found, data scientists are being forced to move large volumes of stored data to dedicated mathematical and statistical computing software. That step takes time and requires additional coding that “adds no analytical value and impedes productivity,” the survey found.

The survey echoes other recent reports about the “fragmentation” of data.

One proponent of “information compression” techniques recently argued that another part of the big data problem is the way “knowledge” is represented in computers.

For example, the researcher cited the long list of image formats such as GIF and JPEG. “This jumble of different formalisms and formats for knowledge is a great complication in the processing of big data,” argued data researcher Gerry Wolff of CognitiveResearch.org.

Despite these challenges, the survey did identify some positive trends. For example, 59 percent of respondents said their company was already using complex analytics to sift through big data. An additional 31 percent said they plan to over the next two years.

The bottom line, according to the Paradigm4 survey results, is that “the ability to effectively use diverse data sources is proving to be a competitive differentiator in many industries.”

Paradigm4’s survey of 111 data scientists was conducted by independent researcher Innovation Enterprise between March and April, 2014.

Related items:

Can the ‘SP Machine’ Straighten Out Big Data

Apache Spark: 3 Real-World Use Cases

Datanami