Hellerstein: Humans are the Bottleneck
Humans are the bottleneck right now in the data space, commented database systems luminary, Joe Hellerstein during an interview this week at Strata 2013.
|Joseph M. Hellerstein is professor of Computer Science at the University of California, Berkeley, where he works on database systems and computer networks.|
“As Moore’s law drives the cost of computing down, and as data becomes more prevalent as a result, what we see is that the remaining bottleneck in computing costs is the human factor,” says Hellerstein, one of the fathers of adaptive query processing and a half dozen other database technologies.
Hellerstein says that recent research studies conducted at Stanford and Berkeley have found that 50-80 percent of a data analyst’s time is being used for the data grunt work (with the rest left for custom coding, analysis, and other duties).
“Data prep, data wrangling, data munging are words you hear over and over,” says Hellerstein. “Even with very highly skilled professionals in the data analysis space, this is where they’re spending their time, and it really is a big bottleneck.”
When asked about whether this is a problem with team size, or more of a technological problem to be solved down the road, Hellerstein indicated that both of these things go hand-in-hand. He says that what he (and fellow data programmers) ultimately would like to see are more machine-driven processes, but acknowledges that there’s a human element in getting there – such as addressing how people perceive data, how they visualize it, and then of course the machine side, where scaling, statistics, and machine learning are involved.
“What we’d like to see is more and more people being able to do these things through automation, and the experts then spending their time on things that are of better use of their expertise,” commented Hellerstein.
“In the end, at some level, it’s a programming problem, and so what we really need to be thinking about is how we make people productive in this task of doing the programming around data prep, data cleaning, data wrangling, and data assessment.”