Follow Datanami:
October 24, 2012

Greenplum, Kaggle Team to Prospect Data Scientists

Ian Armas Foster

With the zettabytes of data available to the world and businesses looking to mine that data for insight, many executives look at the technical landscape and say, “If only we had more qualified data scientists.” The problem with data science is that it is a relatively new discipline. Universities are not to the point where they offer degrees in data science yet. As a result, according to the McKinsey Global Institute, businesses in the United States alone will be short 140,000 to 190,000 data scientists by the year 2018.

Greenplum intends to help solve this problem with a complete open sourcing of their Chorus platform and the resulting partnership with Kaggle, a website which fosters growth in the data science community by hosting data mining competitions among its 57,000 participants. We caught up with Greenplum’s Senior Director of Corporate Development and Strategy, Michael Maxey, and Kaggle CEO Anthony Goldbloom last week ahead of Hadoop World. They discussed the partnership between Greenplum and Kaggle, its impact on Greenplum’s Chorus as well as the implications for data science as a whole.

“We’ve sort of branded Chorus as a collaborative platform for data science,” said Maxey. “Fundamentally, Chorus came out of our experience working with our customers as well as working with our own data science team and with organizations like Kaggle.”

Kaggle itself is a fascinating and effective method for picking out who the world’s top-notch data scientists are. The discipline is relatively new, which makes sense because so too is the problem of companies being overwhelmed with the world’s data, 90% of which was created in the last two years. As such, it is difficult to pick out what on a resume makes someone qualified to run complex algorithms over large datasets.

Per Goldbloom, “Their C.V. might show that they have a computer science background but a company might go, ‘well, how do they know enough statistics?’ So their C.V. might show they have a statistics background, the company might wonder, ‘can they wrangle large datasets’ which requires quite a bit of computing skills. What Kaggle does is it allows people to prove they have the skills instead of having to rely on a C.V that may not answer all the questions.” Indeed, he goes on to note that those atop the Kaggle leaderboards frequently have backgrounds in physics and electrical engineering, where problem solving is at a premium.

They prove those skills over various competitions, where people from all over the world form teams to solve problems that companies bring to Kaggle. The team with the best algorithm wins the competition and the resulting prize money. Because the competitions are ongoing, the best engineers could hypothetically make a living off of Kaggle. Currently, the Heritage Health Prize, where teams attempt to “identify patients who will be admitted to a hospital within the next year using historical claims data,” is being contested among over 1,400 teams for a purse of $3 million.

Integrating with Kaggle was important to Greenplum in order to expand its clients’ ability to get in contact with those floating in the data science free agent pool. Further, Greenplum’s backing could lend enhanced stability to an important environment for fostering data science skills.

“There are 57,000 folks in this community that, prior to the invention of Kaggle, couldn’t really communicate,” said Maxey. “As an enterprise, when you’re trying to solve a problem around customer churn, or fraud, or something to that effect, you have your in-house resources…For us, we feel access to this community helps drive the ability to connect to that.”

Greenplum added to the sentiment that Chorus bridges the gap between the data science community and companies, stating, “What the Chorus platform gives us is a convenient way for our data scientists to start working with companies.”

According to Maxey, Greenplum had assembled a decent team of data scientists to get Greenplum’s clients started with projects on Chorus. For example, they work with a large East coast insurance company that needs to solve problems over hundreds of workspaces. According to Maxey, Chorus provides an interface across which the analysts can collaborate. However, Maxey noted that Greenplum was lucky to put together such a team. Others might not be so fortunate. The hope is that Kaggle could help fill holes that Chorus users come across. “We feel that opening up this community to the enterprise customers through EMC Greenplum and through the open sourcing of Chorus whether Greenplum is involved or not, is really going to help drive a lot of innovation in the analytics space,” Maxey said.

While the partnership with Kaggle was the centerpiece of the Greenplum‘s open sourcing Chorus announcement, there are other benefits. Maxey likened it to Google’s Android phone platform. “One analogy we use behind the open sourcing is the concept of Android, which is the open source phone platform that Google has created. Clearly a big chunk of the Android development  is done inside Google and I think Chorus will follow a similar path.” Greenplum also announced an integration with Gnip, an aggregator of social media, and Tableau, the visual analytics company that has had quite the busy week here at Strata Hadoop World.

Data science as a discipline is young—even its greatest masters have had only a maximum of two years to work on the scales of data that are being thrown at companies. If that scale is only going to grow, so too must the skills of those working with that data. Greenplum and Kaggle hope to, by their powers combined, cultivate that growth.


Related Articles

Open Source Testbed Targets Big Data Development


MapR Traces New Routes for HBase


Making Sport of Data Science