Cloudera Gives a Peek at Future ML Platform
Cloudera continued its evolution away from Hadoop today by announcing a technical preview for Cloudera Machine Learning, its new data science and data engineering platform that’s based on Kubernetes, which enables it to run in the cloud and on premise.
Hadoop’s influence has waned, in many respects, in direct proportion to the rise of public cloud platforms. Instead of taking the time to build and manage Hadoop clusters to store big data and run analytics on them, companies are turning to cloud providers like Amazon Web Services, who can offer cheap object storage and scalable compute resources.
This changing market dynamic has helped to drive 46% year-over-year revenue growth for AWS, and Microsoft Azure and Google Compute Platform are growing even faster. In many ways, the October merger of Cloudera and Hortonworks was a response to this dynamic. But Cloudera thinks it has an advantage that the cloud vendors can’t touch: the capability to support multi-cloud and on-premise computing in a hybrid manner.
That’s the background behind today’s announcement of Cloudera Machine Learning, which will combine data engineering and data science capabilities in a cloud-friendly package. As the company points out, the new offering will work “on any data, anywhere.”
Thanks to Kubernetes, Cloudera Machine Learning can run on Hadoop clusters that companies have installed in their own data centers. But it can also run on public and private clouds – provided they offer Kubernetes (which most do). The new offering supports data stored in HDFS and cloud object stores offered by AWS, Azure, and GCP.
Cloudera Machine Learning is separate from its Hadoop distribution, CDH, but it can utilize the services of CDH in an on-premise or cloud environment, Cloudera says.
“Cloudera Machine Learning is self-contained and manages its own distributed compute, natively running workloads – including but not limited to Apache Spark – in containers on Kubernetes,” the company says. “It can also connect to an existing CDH cluster, on-premises or in the cloud, to leverage its distributed compute (e.g. Spark-on-YARN, Impala), data, or Shared Data Experience (SDX) metadata (Kerberos, HMS, Sentry, Navigator) for full enterprise security, governance, and management.”
As Hilary Mason, the general manager of machine learning for Cloudera, recently told Datanami, Cloudera is executing on a strategy to build a new platform around data science and machine learning. The strategy began with Cloudera Data Science Workbench (CDSW), which is designed to help data scientists and data engineers collaborate on the building of machine learning models. With Cloudera Machine Learning, the company gains more capability to push the learning and inference out to clusters besides Hadoop.
“We really see this as building a platform on a platform,” Mason said in an interview last week. “Specifically we see a suite of products in data science [and] machine learning…that are cloud native that are based on Kubernetes and container-based capabilities that allow for data engineering and data ingest, and that allows for data science modeling and machine learning model management.”
Cloudera has built and solidified (to some extent) the data management pieces that enterprises demand with CDH. While the cloud looms large, the company doesn’t foresee mass exodus to the cloud for many enterprises, which are constrained to keeping on-premise data centers due to regulations.
With that data management piece in place, Cloudera is now turning to higher-level applications – specifically the data science and machine learning capabilities that started with CDSW, continued with today’s introduction of Cloudera Machine Learning, and will continue with as-yet unidentified products.
The Palo Alto, California company is counting on Mason – an accomplished data scientist and co-founder of Fast Forward Labs, which Cloudera acquired last year – to lead its data science strategy and the development of machine learning products. It’s a tough job, as the target is constantly moving, but right now the focus is on automating much of the grunt work that’s keeping data scientists from moving faster.
“Right now, most of our customers that we’re dealing with have perhaps tens of models and it’s a rare customer that’s dealing with more than that,” Mason says. “But if you follow where people want to go and where their investments are going, it implies that in a year or two, most customers will have tens or low hundreds of models, and some have thousands and I can imagine even tens of thousands.”
Just as DevOps emerged to streamline the management of general-purpose applications and the servers that run them, the world of machine learning will benefit from a similar level of automation in the near future.
“I think of it a lot like managing servers, and the evolution of DevOps as a discipline, in that when you have tens of servers, you can sort of do it with cobbled together, home-grown scripts and keeping track of things in spreadsheets, which is what most people do today for model management,” Mason says. “But eventually, when you hit hundreds of servers, you can no longer do that. You need tooling that’s robust that is capable of taking care of those operational needs. So I see the exact same trajectory of machine learning model management.”
Cloudera aims to keep its future machine learning products open, in that customers will be able to obtain and run algorithms from whomever they want, including popular frameworks like scikit-learn and Tensorflow, as well as proprietary packages from third-party vendors.
And by virtue of the fact that it’s running on Kubernetes, Cloudera customers will be able to run wherever they want, Mason points out.
“The thing that distinguishes us from the cloud vendors is our software will run anywhere,” Mason says. “It’s one set of capabilities that also make the IT group happy because it allows for compliance and governance and security, and it’s easily multi-cloud, and allows you to run on your own clusters as well, which is something you don’t currently get from the large cloud providers, so our customers are not locked in the same sense.”
Cloudera Machine Learning is a technical preview at the moment. Companies can request access to the program at this webpage. The software is expected to ship in 2019.