Cloudera Gives Data Scientists More Options for ML
Cloudera unleashed a collection of new software today that’s geared at accelerating the development and deployment of machine learning programs. In addition to a new release of its Data Science Workbench that lets data scientists deploy ML models as APIs with the push of a button, it released a new iteration of its enterprise suite of software based around Apache Hadoop, Cloudera Enterprise 6.0, that offers first-class support for GPUs, among other new features.
Machine learning has always been one of the use cases that Cloudera supports with its open source software, Cloudera Distribution of Hadoop (CDH), as well as its flagship offering, Cloudera Enterprise, which now includes Apache Spark. The capability to detect patterns and anomalies in large data sets and to build business processes that operationalize them is the defining feature of “big data.”
But as much as machine learning has always been a “thing” in the Hadoop world, something changed in Cloudera’s customer base recently that’s resulted in a sudden surge in interest in machine learning. That’s according to Matt Brandwein, a product manager with the Palo Alto, California company.
“We’ve seen a dramatic uptick in data science interest over the last 18 months,” Brandwein tells Datanami. “There’s been a latent demand that we’ve, I think, finally tapped into.”
About a year ago, Cloudera launched Data Science Workbench, which gave data scientists a Web-based data science notebook for building Python and R-based machine learning models that could utilize the data and processing resources of CDH or Cloudera Enterprise clusters. The product was an instant hit, and has become Cloudera’s top-selling piece of software with hundreds of paying customers, Brandwein says.
With version 1.4 the company is adding two new features to Data Science Workbench, dubbed experiments and models, that take the product to the next level.
The experiments features enables data scientists to try many different combination of variables — including data sets, data features, model libraries, algorithms, hyperparameter settings, and processor type. Keeping track of all of these changes would be very difficult to do, especially if the data scientist is working within a larger team. Data Science Workbench makes this task easier by automatically logging all the changes and maintaining it as a knowledge base, which helps with model lineage, auditability, and collaboration.
The new model feature, meanwhile, allows a data scientist to easily deploy her experiments as APIs, without involving any data engineering resources to make that happen. This is an important point, Brandwein says, as the typical machine learning workflow involves a model being developed by the data scientist, who then hands the model to a data engineer, who then re-writes the model to make it suitable for at-scale deployments. The problem is that process takes time, is expensive, and introduces unwanted variables into the equation.
Now Cloudera is streamlining that whole process, Brandwein says. “Anything you can express as a function, you can deploy as a service,” he says. “So whether your model is expressed as Tensorflow or PySpark or Scikit learn or pick your favorite R package — as long as you can express the scoring function as a function, we now have a button you can click in the Data Science Workbench that will expose that function as a REST endpoint.”
That doesn’t mean that data scientists will never want to run their models through the data engineering team to re-write them as extensible Java code or performant C code. Some models will need that treatment to make them bullet-proof. But that doesn’t take away from the benefits of being able to quickly deploy a machine learning model as a REST API and get it out in the real world.
It’s all about giving data scientists more freedom to innovative and create, Brandwein says. “We’re trying to make it so the data scientist can more rapidly deploy more models without the time-intensive, error-prone re-coding that typically goes into the process,” he says. “We’re trying to encourage experimentation and rapid prototyping and that is what this gives data scientists — a push-button way to expose their models to business partners, application developers, dashboard builders and the like.”
The addition of the API deployment model gives the product a fourth main deployment pattern for machine learning models built using Data Science Workbench. It previously supported three: batch, interactive, and streaming environments. From the sound of it, Cloudera plans to deliver a fifth deployment pattern — delivery to edge devices — in a future release.
The company also today unveiled Cloudera Enterprise 6, a new release of its flagship distribution of software based around Apache Hadoop. This release brings Apache Hadoop version 3 into the paying Cloudera customer base, which means features like erasure encoding and support for GPUs with YARN are now part of the package. It also updates all of the accompanying big data projects that make up Cloudera’s distribution, including Solr 7, which Cloudera customers have been looking forward to, Brandwein says.
GPU support will bolster machine learning workloads, because many of the ML libraries that customers are using, such as Tensorflow, run faster on GPUs. “YARN can now schedule GPU s in heterogeneous compute cluster, so we can now start to do distributed training outside of the Data Science Workbench environment, actually in the Cloudera cluster itself,” Brandwein says. Customers will be able to train machine learning models five to 10 times faster with GPUs, Cloudera claims. This release also makes Apache Spark and Apache Kafka standard components of the suite, according to Cloudera’s blog.
Lastly, Cloudera made a major improvement to Altus Data Engineering, the cloud-based offering that’s geared towared helping data engineers build data pipelines and automate data transformation tasks. This offering now runs on Microsoft Azure, in addition to the AWS cloud.
The company made these announcements at its annual Strata Data Conference in London, UK.