Follow Datanami:
August 6, 2020

R Works Its Way Into Qubole’s Data Lake


Python may be the most popular language for data science today, but R isn’t far behind. In fact, R is experiencing a bit of a resurgence at the moment. That’s part of the reason that Qubole today announced that it has expanded its data lake offering with commercial and open source R software from RStudio.

Companies signing up for the hosted Qubole data lake offering will now get access to R products from RStudio, including R Studio Server Pro, the commercial integrated development environment (IDE) for developing R applications, as well as sparklyr (pronounced “sparkly R”), the open source R environment for working with data in Apache Spark.

Qubole and RStudio have done the hard work to integrate the R offerings into Qubole’s data lake, which brings a range of managed compute engines for data engineering, data science, and advanced analytic use cases running atop cloud object stores on AWS, Google Cloud, Microsoft Azure, and Oracle Cloud.

The idea is to make it simple for data scientists to get up and running with fast and secure R environments in Qubole, without requiring them to get their hands dirty with the technical details, says Mohit Bhatnagar, SVP of products at Qubole.

RStudio Pro is an IDE for developing data analysis scripts, reports, graphs, and interactive Web applications using R and Python

“We not only did a first-class partnership, but we’re also providing R as a single-button access,” Bhatagar says. “The ease of use is the key differentiator.”

The partnership with Qubole gives RStudio a place for R users to run R workloads in a scalable big data environment, says Lou Bajuk, the director of product marketing with RStudio.

“A lot of the challenge that we help customers tackle are how do you take this small project, a small effort you’ve done internally, and scale it out,” he tells Datanami. “You see that in terms of the Qubole integration. They start analyzing data in R. They have this massive data they want to access in the Qubole data lake. We worked with Qubole to make that easy.”

The two companies worked out what amounts to an OEM deal that gives Qubole users direct access to R Studio Server Pro within the Qubole environment, at no additional cost. Bajuk says it’s “super easy” to spin up R Studio Server Pro and start working with the R IDE from a Web browser.

On the open source side, RStudio initially developed the sparklyr package about five years ago to enable R to work with data stored in Spark DataFrames. The project has been successful, but that doesn’t necessarily mean that getting the environment set up is easy.

RStudio converted into a public benefit corporation in January 2020

“Sometimes [R] jobs are pretty easy to parallelize, and the hard part is accessing the cluster,” Bajuk says. “That goes back to our work on Kubernetes and Slurm [and] how do we make it super easy for a data scientist to be able access all these computational resources.”

R is experiencing a resurgence at the moment following a drop in popularity in 2018 and 2019. That’s according to metrics like the TIOBE Index, which last month announced that R jumped from being the 20th most popular language in July 2019 to number 8 on the list in July 2020, a staggering leap of 12 places.

Python clearly has become the dominant language for data science. But according to the folks at RStudio, that popularity hasn’t necessarily come at the expense of R, which for years was the most popular open source language for statistical computing before Python began to pull away from the back about five years ago.

Interestingly, the folks at Qubole began looking into support for R near the end of 2019, which arguably is before R’s current resurgence began (some attribute the sudden growth in R’s popularity to COVID-19 and the spike in demand for an open statistical language to investigate data related to the novel coronavirus and its impact on society).

“I did not what to join the dichotomy of Python vs R. We were already supporting Python. Everyone is using Python,” Qubole’s Bhatagar says. “What happens is, when we went and talked to a bunch of our customers–we interviewed 46 people–it became obvious that they’re going to continue to use it [R]. And in fact, the dominant use case for this customer base was Python plus R.”

Companies that use R (image courtesy Qubole)

The survey concluded that many major companies continue to use R, which some consider a more mature language for statistical work than Python. Some of these R users were concerned that they would be forced to move their R work into Python, which would entail a lot of work. That concern motivated Qubole to ensure that R would be a first-class citizen in its cloud.

But there are some sticking points around R, including the difficulty in configuring the environment. There was also a concern about what happens to R when a machine go down, which Qubole addressed by developing automatic persistence into the offering.

With the news of R’s demise swirling about, Qubole kept its eye on the ball and followed through with its commitment.

“We did hear news about R. But then we went and did an analysis, which 9 months back, led the boards of companies to say, we are actually going to embark upon this project,” he says. “The fact that there’s a resurgence [in R] — we like it, and in that sense it validates the decision we made.”

Related Items:

Left for Dead, R Surges Again

Simplifying the Big Data Lake Experiences in the Cloud

Is Python Strangling R to Death?