Language Flags

Translation Disclaimer

HPCwire HPC in the Cloud Digital Manufacturing Report Green Computing Report


September 12, 2012

Research Aims to Automate the Impossible


Dr. Michael Cafarella is one of the leading experts in the big data field, having created Hadoop predecessor Nutch with Hadoop creator Doug Cutting.

These days, Cafarella is a professor at the University of Michigan, where he is free to explore his obsession with automating that which cannot be easily automated. At the heart of some of his adventures in automation is a surprisingly “simple” tool—the almighty spreadsheet.

Many businesses have been creating hundreds of spreadsheets for years. Those spreadsheets could provide a significant amount of useful data. However, spreadsheets are not easily translatable to relational databases that could make use of that information. That’s where Cafarella’s research comes in.

“When you open up a spreadsheet and start typing stuff in, you’re creating a small database, something that may last for decades. People’s spreadsheets, especially in companies, stick around a lot longer than they think they’re going to… We’re working on a mechanism that automatically transforms these spreadsheets into a more traditional database so you can get a lot more services off of it. You can ask queries, make visualizations.”

He lays out his most recent automation projects in the snazzy video below, which was produced by the U of M and while informative on its own, serves as a sort of an introduction for the school’s computer science and engineering students.

His research caught the attention of Dow Chemical, who awarded Cafarella a portion of a $250 million grant to advance his work on incorporating spreadsheets into big data platforms, according to the University of Michigan.

Along with receiving a significant one-year grant from Dow, Cafarella has also received funding from Yahoo (a heavy user of Hadoop), General Electric, Google, and the National Science Foundation. The latter awarded him an NSF CAREER award last year, according to his bio.

Another of his projects involves using text-based analytics from blog posts, social media, and other sources to determine the state and path of the national economy. The goal is to develop a system that surpasses the government’s efficiency with a fraction of the resources.

“The federal government,” Cafarella said “spends a lot of money and time collecting data about things like unemployment. We’re trying to develop a mechanism that looks at what people are saying in order to reproduce or add on to those statistics, a lot cheaper than their techniques. It comes out a lot quicker.”

While news articles, polls, and statistics can be an informative indicator of the economy, those large-scale sources are subject to reporting bias. Collecting and analyzing blog posts and social media, on the other hand, harbors a certain the exact bias Cafarella is looking for: the answer to the question, “How is the economy treating this individual?”

“If someone writes in a blog post, ‘I really need to find a job,’” Cafarella explains “that’s a little bit of evidence that the economy is getting a little bit worse. Our project tries to look at a lot of sources like that and tries to make a prediction about what the economy is going to do.”

Of course, the amount of social media that exists makes it difficult to analyze it without even taking into account the notorious effort required to teach computers semantics.

Either way, Cafarella must be making progress. Cafarella is using this background in text-based analytics to develop a platform called RecordBreaker for Cloudera. According to the Cloudera website, “The RecordBreaker project aims to automatically generate structure for text-embedded data.” Cloudera is not an insignificant Hadoop vendor, and the fact that they tagged Cafarella for this work must mean he has made significant progress.

Cafarella’s work, already instrumental in developing Hadoop, could serve to advance the automation of processes that are rather difficult to automate.

Related Stories

MapReduce Makes Further Inroads in Academia

Study Stacks MySQL, MapReduce and Hive

Cloudera CTO Reflects on Hadoop Underpinnings

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 
Cray CS300-LC

Sponsored Links

Sponsored Whitepapers

Best Practices in Big Data Storage - Sponsored by Cleversafe, Cray, DDN, NetApp, & Panasas

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas

From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Download this Whitepaper...

Big Data, Big Brains – Sponsored By NetApp

04/22/2013 | NetApp

Big data has proven to be one of the most promising yet challenging technologies for both government and industry. But, before IT leaders can harness the full potential of big data, there are key issues to address surrounding infrastructure, storage, personnel, and training.
MeriTalk surveyed 17 visionary big data leaders to find out what they see as the big data challenges and opportunities as well as how government can best leverage big data. Download the “Big Data, Big Brains Report”.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

SGI President and CEO, Jorge Titinger, on Big Data

SGI President and CEO, Jorge Titinger, talks about SGI's history and leadership in HPC and how that has converged into Big Data Solutions.

View Multimedia

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

View Multimedia

More Multimedia



Job Bank

Datanami Conferences Ad

Featured Events

May 22-23, 2013
Business Intelligence Innovation Summit
Chicago, IL
United States

June 4-4, 2013
The Economist's Information Forum
San Francisco, CA
United States

June 10-13, 2013
Cloud & Big Data Expo
New York City, NY
United States

June 19-20, 2013
GigaOM Structure
San Francisco, CA
United States

June 26-27, 2013
2013 Hadoop Summit
San Jose, CA
United States

June 26-27, 2013
Big Data World Congress
London
United Kingdom

» View/Search Events

» Post an Event