September 12, 2012

Research Aims to Automate the Impossible

Ian Armas Foster

Dr. Michael Cafarella is one of the leading experts in the big data field, having created Hadoop predecessor Nutch with Hadoop creator Doug Cutting.

These days, Cafarella is a professor at the University of Michigan, where he is free to explore his obsession with automating that which cannot be easily automated. At the heart of some of his adventures in automation is a surprisingly “simple” tool—the almighty spreadsheet.

Many businesses have been creating hundreds of spreadsheets for years. Those spreadsheets could provide a significant amount of useful data. However, spreadsheets are not easily translatable to relational databases that could make use of that information. That’s where Cafarella’s research comes in.

“When you open up a spreadsheet and start typing stuff in, you’re creating a small database, something that may last for decades. People’s spreadsheets, especially in companies, stick around a lot longer than they think they’re going to… We’re working on a mechanism that automatically transforms these spreadsheets into a more traditional database so you can get a lot more services off of it. You can ask queries, make visualizations.”

He lays out his most recent automation projects in the snazzy video below, which was produced by the U of M and while informative on its own, serves as a sort of an introduction for the school’s computer science and engineering students.

His research caught the attention of Dow Chemical, who awarded Cafarella a portion of a $250 million grant to advance his work on incorporating spreadsheets into big data platforms, according to the University of Michigan.

Along with receiving a significant one-year grant from Dow, Cafarella has also received funding from Yahoo (a heavy user of Hadoop), General Electric, Google, and the National Science Foundation. The latter awarded him an NSF CAREER award last year, according to his bio.

Another of his projects involves using text-based analytics from blog posts, social media, and other sources to determine the state and path of the national economy. The goal is to develop a system that surpasses the government’s efficiency with a fraction of the resources.

“The federal government,” Cafarella said “spends a lot of money and time collecting data about things like unemployment. We’re trying to develop a mechanism that looks at what people are saying in order to reproduce or add on to those statistics, a lot cheaper than their techniques. It comes out a lot quicker.”

While news articles, polls, and statistics can be an informative indicator of the economy, those large-scale sources are subject to reporting bias. Collecting and analyzing blog posts and social media, on the other hand, harbors a certain the exact bias Cafarella is looking for: the answer to the question, “How is the economy treating this individual?”

“If someone writes in a blog post, ‘I really need to find a job,’” Cafarella explains “that’s a little bit of evidence that the economy is getting a little bit worse. Our project tries to look at a lot of sources like that and tries to make a prediction about what the economy is going to do.”

Of course, the amount of social media that exists makes it difficult to analyze it without even taking into account the notorious effort required to teach computers semantics.

Either way, Cafarella must be making progress. Cafarella is using this background in text-based analytics to develop a platform called RecordBreaker for Cloudera. According to the Cloudera website, “The RecordBreaker project aims to automatically generate structure for text-embedded data.” Cloudera is not an insignificant Hadoop vendor, and the fact that they tagged Cafarella for this work must mean he has made significant progress.

Cafarella’s work, already instrumental in developing Hadoop, could serve to advance the automation of processes that are rather difficult to automate.

Study Stacks MySQL, MapReduce and Hive

Cloudera CTO Reflects on Hadoop Underpinnings

Applications: Research Analytics

Sectors: Academia

Tags: automation, carafella, cloudera, data, michigan, nutch, speadsheets

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Research Aims to Automate the Impossible

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 23, 2024

April 22, 2024

April 19, 2024

April 18, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Research Aims to Automate the Impossible

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 23, 2024

April 22, 2024

April 19, 2024

April 18, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link