September 9, 2013

Putting the “R” Into Hadoop

Alex Woodie

With the amount of data being fed into Hadoop these days, it’s natural for customers to want to do some statistical analysis upon it. Revolution Analytics, which distributes a commercial version of the R statistical environment, says it has lowered the bar of entry for R programmers to work with Hadoop by parallelizing many of the R algorithms.

There is currently large demand for senior data scientists who can write parallel programs for Hadoop. That’s a lucrative thing for somebody with the right skills, but an expensive proposition for an organization looking to parse their big data in new and profitable ways.

Bill Jacobs, director product marketing, at Revolution Analytics, has heard some horror stories about the “blood money” companies are paying for top Hadoop skills, particularly those who know statistics.

Jacobs says he heard from a CIO who was between a rock and hard spot. “He had to go out and pay $300,000, an ungodly amount of money, to hire a data scientist because he needed someone who could understand heavy statistics, Hadoop modeling, and all the stuff that’s easily done in R,” he says in an interview. “But he needed someone who could do it in Java on Hadoop, and that is a very, very sought after resource.”

With the launch of Revolution R Enterprise 7.0, expected later this year, Revolution Analytics will have parallelized many of the algorithms in the R library, and otherwise streamlined the capability for them to run under the Hadoop distributions from Hortonworks and Cloudera.

This will open up new big data opportunities for the population of 2 million R programmers around the world, and allow them to use their skills against Hadoop-based data, without needing to know Java, Python, MapReduce, or how to write parallel algorithms. (They will, of course, need the package from Revolution Analytics to enable this.)

“We bring the corporation that’s going into the Hadoop world a chance to tap a huge–and, particularly, modestly priced–talent base as opposed to a Ph.D Stanford bioinformatics statistician. Those are $300,000 per year resources,” Jacobs says. “Java is a lovely thing if you’re a Java programmer. But you’re a statistician. You didn’t learn Java in school. You’re an R programmer.”

Revolution Analytics isn’t giving the Hadoop treatment to all of the 4,700 or so algorithms in the CRAN library. But a good number of them are, Jacobs says, including algorithms such as those for generalized linear model, logistics regression, linear regression, stepwise linear, and k-means clustering.

This is not the first time that Revolution Analytics has tackled the big elephant in the room. About nine months ago it released a package that enabled the R language to run against Hadoop file systems. However, it left a lot to be desired in the ease-of-use department, and required programmers to have strong Hadoop skills and to write algorithms in ways that would tolerate parallelism.

That approach doesn’t fly for organizations who eschew the “build it yourself” approach in favor of shrink-wrapped packages. Revolution Analytics had, in fact, already addressed those parallelism challenges in an HPC context, when it released versions of it package optimized for IBM’s Platform LSF and Microsoft’s HPC Server.

As Hadoop gained in popularity, it made sense for the company to take what it learned in the past with HPC clusters, and “morph it just enough so it fits as a rational well-behaved citizen within the Hadoop infrastructure,” Jacobs says, “and yet preserves the value that built up around R with our Revolution R Analytics platform.

Supporting the Hadoop distributions from Hortonworks and Cloudera will certainly give R a better big data story. But the company is cognizant of the pace of change occurring in the data analytics industry right now, and is hoping to insulate itself to some extent against disruptive new technologies that will inevitably rear their heads. To that end, expect some big news in the next week or so along the lines of the Hadoop support unveiled in late August (although it sounds like it won’t be Hadoop oriented).

“Hadoop is not the end all. We’ve seen Gen 2. What’s Gen 3?,” Jacobs asks. “Being able to abstract the analytics away from the platform provides a long term investment protection for organizations that are building not only big data applications but big data skills bases. Because Hadoop isn’t going to stay still. I’ve talked to CIOs who are worried about what’s next after Hadoop as we know it and love it today.”

Technologies like YARN, MPI, Storm, and Spark are already promising to build on the big data foundation that Hadoop laid so well. But if anything remains the same, it’s the fact that things will change in ways that cannot be foreseen today.

Related Articles

Cloudera Search 1.0: Like Googling Hadoop

Twitter Conjures Up a Hadoop-Storm Hybrid, Ponders IPO

Data Driving the Exit Into Hadoop

Vendors: Revolution Analytics

Tags: big data, Hadoop, R, revolution analytics

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Putting the “R” Into Hadoop

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 1, 2024

April 30, 2024

April 29, 2024

April 26, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Putting the “R” Into Hadoop

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 1, 2024

April 30, 2024

April 29, 2024

April 26, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link