September 9, 2013

Putting the “R” Into Hadoop

Alex Woodie

With the amount of data being fed into Hadoop these days, it’s natural for customers to want to do some statistical analysis upon it. Revolution Analytics, which distributes a commercial version of the R statistical environment, says it has lowered the bar of entry for R programmers to work with Hadoop by parallelizing many of the R algorithms.

There is currently large demand for senior data scientists who can write parallel programs for Hadoop. That’s a lucrative thing for somebody with the right skills, but an expensive proposition for an organization looking to parse their big data in new and profitable ways.

Bill Jacobs, director product marketing, at Revolution Analytics, has heard some horror stories about the “blood money” companies are paying for top Hadoop skills, particularly those who know statistics.

Jacobs says he heard from a CIO who was between a rock and hard spot. “He had to go out and pay $300,000, an ungodly amount of money, to hire a data scientist because he needed someone who could understand heavy statistics, Hadoop modeling, and all the stuff that’s easily done in R,” he says in an interview. “But he needed someone who could do it in Java on Hadoop, and that is a very, very sought after resource.”

With the launch of Revolution R Enterprise 7.0, expected later this year, Revolution Analytics will have parallelized many of the algorithms in the R library, and otherwise streamlined the capability for them to run under the Hadoop distributions from Hortonworks and Cloudera.

This will open up new big data opportunities for the population of 2 million R programmers around the world, and allow them to use their skills against Hadoop-based data, without needing to know Java, Python, MapReduce, or how to write parallel algorithms. (They will, of course, need the package from Revolution Analytics to enable this.)

“We bring the corporation that’s going into the Hadoop world a chance to tap a huge–and, particularly, modestly priced–talent base as opposed to a Ph.D Stanford bioinformatics statistician. Those are $300,000 per year resources,” Jacobs says. “Java is a lovely thing if you’re a Java programmer. But you’re a statistician. You didn’t learn Java in school. You’re an R programmer.”

Revolution Analytics isn’t giving the Hadoop treatment to all of the 4,700 or so algorithms in the CRAN library. But a good number of them are, Jacobs says, including algorithms such as those for generalized linear model, logistics regression, linear regression, stepwise linear, and k-means clustering. 

This is not the first time that Revolution Analytics has tackled the big elephant in the room. About nine months ago it released a package that enabled the R language to run against Hadoop file systems. However, it left a lot to be desired in the ease-of-use department, and required programmers to have strong Hadoop skills and to write algorithms in ways that would tolerate parallelism.

That approach doesn’t fly for organizations who eschew the “build it yourself” approach in favor of shrink-wrapped packages. Revolution Analytics had, in fact, already addressed those parallelism challenges in an HPC context, when it released versions of it package optimized for IBM’s Platform LSF and Microsoft’s HPC Server.

As Hadoop gained in popularity, it made sense for the company to take what it learned in the past with HPC clusters, and “morph it just enough so it fits as a rational well-behaved citizen within the Hadoop infrastructure,” Jacobs says, “and yet preserves the value that built up around R with our Revolution R Analytics platform.

Supporting the Hadoop distributions from Hortonworks and Cloudera will certainly give R a better big data story. But the company is cognizant of the pace of change occurring in the data analytics industry right now, and is hoping to insulate itself to some extent against disruptive new technologies that will inevitably rear their heads. To that end, expect some big news in the next week or so along the lines of the Hadoop support unveiled in late August (although it sounds like it won’t be Hadoop oriented).

“Hadoop is not the end all. We’ve seen Gen 2. What’s Gen 3?,” Jacobs asks. “Being able to abstract the analytics away from the platform provides a long term investment protection for organizations that are building not only big data applications but big data skills bases. Because Hadoop isn’t going to stay still. I’ve talked to CIOs who are worried about what’s next after Hadoop as we know it and love it today.” 

Technologies like YARN, MPI, Storm, and Spark are already promising to build on the big data foundation that Hadoop laid so well. But if anything remains the same, it’s the fact that things will change in ways that cannot be foreseen today.

Related Articles

Cloudera Search 1.0: Like Googling Hadoop

Twitter Conjures Up a Hadoop-Storm Hybrid, Ponders IPO

Data Driving the Exit Into Hadoop