Follow Datanami:
September 12, 2012

Research Aims to Automate the Impossible

Ian Armas Foster

Dr. Michael Cafarella is one of the leading experts in the big data field, having created Hadoop predecessor Nutch with Hadoop creator Doug Cutting.

These days, Cafarella is a professor at the University of Michigan, where he is free to explore his obsession with automating that which cannot be easily automated. At the heart of some of his adventures in automation is a surprisingly “simple” tool—the almighty spreadsheet.

Many businesses have been creating hundreds of spreadsheets for years. Those spreadsheets could provide a significant amount of useful data. However, spreadsheets are not easily translatable to relational databases that could make use of that information. That’s where Cafarella’s research comes in.

“When you open up a spreadsheet and start typing stuff in, you’re creating a small database, something that may last for decades. People’s spreadsheets, especially in companies, stick around a lot longer than they think they’re going to… We’re working on a mechanism that automatically transforms these spreadsheets into a more traditional database so you can get a lot more services off of it. You can ask queries, make visualizations.”

He lays out his most recent automation projects in the snazzy video below, which was produced by the U of M and while informative on its own, serves as a sort of an introduction for the school’s computer science and engineering students.

His research caught the attention of Dow Chemical, who awarded Cafarella a portion of a $250 million grant to advance his work on incorporating spreadsheets into big data platforms, according to the University of Michigan.

Along with receiving a significant one-year grant from Dow, Cafarella has also received funding from Yahoo (a heavy user of Hadoop), General Electric, Google, and the National Science Foundation. The latter awarded him an NSF CAREER award last year, according to his bio.

Another of his projects involves using text-based analytics from blog posts, social media, and other sources to determine the state and path of the national economy. The goal is to develop a system that surpasses the government’s efficiency with a fraction of the resources.

“The federal government,” Cafarella said “spends a lot of money and time collecting data about things like unemployment. We’re trying to develop a mechanism that looks at what people are saying in order to reproduce or add on to those statistics, a lot cheaper than their techniques. It comes out a lot quicker.”

While news articles, polls, and statistics can be an informative indicator of the economy, those large-scale sources are subject to reporting bias. Collecting and analyzing blog posts and social media, on the other hand, harbors a certain the exact bias Cafarella is looking for: the answer to the question, “How is the economy treating this individual?”

“If someone writes in a blog post, ‘I really need to find a job,’” Cafarella explains “that’s a little bit of evidence that the economy is getting a little bit worse. Our project tries to look at a lot of sources like that and tries to make a prediction about what the economy is going to do.”

Of course, the amount of social media that exists makes it difficult to analyze it without even taking into account the notorious effort required to teach computers semantics.

Either way, Cafarella must be making progress. Cafarella is using this background in text-based analytics to develop a platform called RecordBreaker for Cloudera. According to the Cloudera website, “The RecordBreaker project aims to automatically generate structure for text-embedded data.” Cloudera is not an insignificant Hadoop vendor, and the fact that they tagged Cafarella for this work must mean he has made significant progress.

Cafarella’s work, already instrumental in developing Hadoop, could serve to advance the automation of processes that are rather difficult to automate.

Related Stories

MapReduce Makes Further Inroads in Academia

Study Stacks MySQL, MapReduce and Hive

Cloudera CTO Reflects on Hadoop Underpinnings