DataTorrent
Language Flags

Translation Disclaimer

HPCwire Enterprise Tech HPCwire Japan


September 09, 2013

Putting the "R" Into Hadoop


With the amount of data being fed into Hadoop these days, it’s natural for customers to want to do some statistical analysis upon it. Revolution Analytics, which distributes a commercial version of the R statistical environment, says it has lowered the bar of entry for R programmers to work with Hadoop by parallelizing many of the R algorithms.

There is currently large demand for senior data scientists who can write parallel programs for Hadoop. That’s a lucrative thing for somebody with the right skills, but an expensive proposition for an organization looking to parse their big data in new and profitable ways.

Bill Jacobs, director product marketing, at Revolution Analytics, has heard some horror stories about the “blood money” companies are paying for top Hadoop skills, particularly those who know statistics.

Jacobs says he heard from a CIO who was between a rock and hard spot. “He had to go out and pay $300,000, an ungodly amount of money, to hire a data scientist because he needed someone who could understand heavy statistics, Hadoop modeling, and all the stuff that’s easily done in R,” he says in an interview. “But he needed someone who could do it in Java on Hadoop, and that is a very, very sought after resource.”

With the launch of Revolution R Enterprise 7.0, expected later this year, Revolution Analytics will have parallelized many of the algorithms in the R library, and otherwise streamlined the capability for them to run under the Hadoop distributions from Hortonworks and Cloudera.

This will open up new big data opportunities for the population of 2 million R programmers around the world, and allow them to use their skills against Hadoop-based data, without needing to know Java, Python, MapReduce, or how to write parallel algorithms. (They will, of course, need the package from Revolution Analytics to enable this.)

“We bring the corporation that's going into the Hadoop world a chance to tap a huge--and, particularly, modestly priced--talent base as opposed to a Ph.D Stanford bioinformatics statistician. Those are $300,000 per year resources,” Jacobs says. “Java is a lovely thing if you're a Java programmer. But you're a statistician. You didn’t learn Java in school. You’re an R programmer.”

Revolution Analytics isn't giving the Hadoop treatment to all of the 4,700 or so algorithms in the CRAN library. But a good number of them are, Jacobs says, including algorithms such as those for generalized linear model, logistics regression, linear regression, stepwise linear, and k-means clustering. 

This is not the first time that Revolution Analytics has tackled the big elephant in the room. About nine months ago it released a package that enabled the R language to run against Hadoop file systems. However, it left a lot to be desired in the ease-of-use department, and required programmers to have strong Hadoop skills and to write algorithms in ways that would tolerate parallelism.

That approach doesn’t fly for organizations who eschew the “build it yourself” approach in favor of shrink-wrapped packages. Revolution Analytics had, in fact, already addressed those parallelism challenges in an HPC context, when it released versions of it package optimized for IBM’s Platform LSF and Microsoft’s HPC Server.

As Hadoop gained in popularity, it made sense for the company to take what it learned in the past with HPC clusters, and “morph it just enough so it fits as a rational well-behaved citizen within the Hadoop infrastructure,” Jacobs says, “and yet preserves the value that built up around R with our Revolution R Analytics platform.

Supporting the Hadoop distributions from Hortonworks and Cloudera will certainly give R a better big data story. But the company is cognizant of the pace of change occurring in the data analytics industry right now, and is hoping to insulate itself to some extent against disruptive new technologies that will inevitably rear their heads. To that end, expect some big news in the next week or so along the lines of the Hadoop support unveiled in late August (although it sounds like it won’t be Hadoop oriented).

“Hadoop is not the end all. We’ve seen Gen 2. What's Gen 3?,” Jacobs asks. “Being able to abstract the analytics away from the platform provides a long term investment protection for organizations that are building not only big data applications but big data skills bases. Because Hadoop isn’t going to stay still. I’ve talked to CIOs who are worried about what’s next after Hadoop as we know it and love it today.” 

Technologies like YARN, MPI, Storm, and Spark are already promising to build on the big data foundation that Hadoop laid so well. But if anything remains the same, it’s the fact that things will change in ways that cannot be foreseen today.

Related Articles

Cloudera Search 1.0: Like Googling Hadoop

Twitter Conjures Up a Hadoop-Storm Hybrid, Ponders IPO

Data Driving the Exit Into Hadoop

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 

Most Read Features

Most Read News

Most Read This Just In



Sponsored Whitepapers

Planning Your Dashboard Project

02/01/2014 | iDashboards

Achieve your dashboard initiative goals by paving a path for success. A strategic plan helps you focus on the right key performance indicators and ensures your dashboards are effective. Learn how your organization can excel by planning out your dashboard project with our proven step-by-step process. This informational whitepaper will outline the benefits of well-thought dashboards, simplify the dashboard planning process, help avoid implementation challenges, and assist in a establishing a post deployment strategy.

Download this Whitepaper...

Slicing the Big Data Analytics Stack

11/26/2013 | HP, Mellanox, Revolution Analytics, SAS, Teradata

This special report provides an in-depth view into a series of technical tools and capabilities that are powering the next generation of big data analytics. Used properly, these tools provide increased insight, the possibility for new discoveries, and the ability to make quantitative decisions based on actual operational intelligence.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

Webinar: Powering Research with Knowledge Discovery & Data Mining (KDD)

Watch this webinar and learn how to develop “future-proof” advanced computing/storage technology solutions to easily manage large, shared compute resources and very large volumes of data. Focus on the research and the application results, not system and data management.

View Multimedia

Video: Using Eureqa to Uncover Mathematical Patterns Hidden in Your Data

Eureqa is like having an army of scientists working to unravel the fundamental equations hidden deep within your data. Eureqa’s algorithms identify what’s important and what’s not, enabling you to model, predict, and optimize what you care about like never before. Watch the video and learn how Eureqa can help you discover the hidden equations in your data.

View Multimedia

More Multimedia

NVIDIA

Job Bank

Datanami Conferences Ad

Featured Events

May 5-11, 2014
Big Data Week Atlanta
Atlanta, GA
United States

May 29-30, 2014
StampedeCon
St. Louis, MO
United States

June 10-12, 2014
Big Data Expo
New York, NY
United States

June 18-18, 2014
Women in Advanced Computing Summit (WiAC ’14)
Philadelphia, PA
United States

June 22-26, 2014
ISC'14
Leipzig
Germany

» View/Search Events

» Post an Event