Follow Datanami:
August 28, 2018

Cloudera Pivots from Zoo Animals to Data Warehousing

As an amalgamation of various open source projects, Hadoop has always been a little wild, which kept followers on the cutting edge, for better and for worse. With today’s launch of Cloudera Data Warehouse, Cloudera is taking a step away from the raucous zoo and toward the more button-down and shrink-wrapped world of data warehousing.

“A common complaint we’ve heard from our customers is they have grown beyond the zoo animals,” says Anupam Singh, GM of analytics at Cloudera. “Customers start with 100 or 200 [nodes] and they want to go to 500, and they want all of these components to work seamlessly.  And we are acknowledging the reality that they are treating us more and more as a data warehouse and less and less as a set of collection of open source projects that they have to keep track of.”

Today’s announcement brings Cloudera’s strategy more closely in line with the data warehousing world. There are three main components of the announcement.

First, the company just introduced something called Cloudera Data Warehouse, which is basically a pre-canned version of the company’s Hadoop distribution configured specifically to excel at data warehousing workloads. Cloudera says it’s a new product that replaces Cloudera Analytic DB, which has been used by more than 800 customers. The company also sells pre-configured SKUs under its Enterprise Data Hub for data science and engineering and operational databases.

Secondly, Cloudera is launching a companion product for the Cloudera Data Warehouse called Workload XM that’s designed to help companies better manage their data warehouse. Short for “experience management,” Workload XM will help with troubleshooting and debugging performance issues, managing applications, and capacity planning, according to Singh. Although the product will sold be an optional product under the Cloudera Data Warehouse umbrella, it will deliver insight into any YARN-compatible workload, including Spark, HBase, MapReduce, and other Hadoop engines.

Thirdly, the company announced the general availability of Altus Data Warehouse, which is the cloud version of that data warehousing product. The company is providing all the SQL processing goodness delivered by Impala, Kudu and Hive as a managed service on the AWS and Microsoft clouds. Altus Data Warehouse is the new name for the product that was previously called Altus Analytic DB, which has been in beta since Cloudera announced it in the fall of 2017.

While machine learning, search, and streaming analytics are all important workloads for Cloudera, nothing moves the needle for Cloudera like data warehousing. According to Singh, the workload with the biggest number of jobs on Cloudera’s data hub is SQL. “Because data warehousing is such a big part of our business, we want to address that user base first, then we will widen the scope,” he says.

As SQL-based data warehousing workloads have grown, Cloudera has changed its course somewhat over the past year and a half to adapt to needs of its customers. “That’s one big area of investment that’s different than let’s say 18 months ago,” Singh says. “Eighteen months ago, we realized a lot of customer had trouble managing workloads.”

That spurred Cloudera to create Workload Experience Management. According to Singh, the WEM product will help administrators and developers get more out of their Cloudera cluster by giving them better insight into how Hadoop jobs ran in the past, how they are running in the present, and how they are expected to run in the future.

These insights will make it easier for Cloudera customers to run their Hadoop cluster at scale without becoming experts in all gory technical details that accompany big and complex Hadoop implementations, Singh says.

“Over the first few years of Hadoop you had these really uber technical people who would follow every patch and everything that happened in the cluster,” Singh tells Datanami. “Early adopters are like that.”

As customers find success with their first use cases, it’s not uncommon to see Hadoop clusters that are around 100 nodes, have perhaps 100 users, and run maybe 100,000 jobs. “The next level is a massive step function, meaning now you have 500 users, 500,000 jobs and 500 to 1,000 nodes,” he continues.

“In that environment, users started complaining to us in the last 12 to 18 months saying we have to wade through logs, we have to figure out which release works with which platform,” he says. “With Workload Experience Management, an administrator …can identify the problem and fix it within 15 minutes, compared to 16 to 20 hours of log fishing, if you will.”

The challenge for Cloudera is to tighten up the loose parts of the Hadoop ecosystem that give people the most headaches while retaining the flexibility that made Hadoop so attractive for multiple big data workloads. That goes for configuring clusters for specific workloads, as well as for tuning clusters to deliver certain performance characteristics.

“There are a spectrum of users on our platform,” Singh says. “Some of them are extremely aggressive. They write jobs that could go for hours and consume all the CPU memory and I/O of a 800 node cluster. And then you have other users who would rather have their queries come back in half a second because writing simple queries.”

Meanwhile, the delivery of Altus Data Warehouse on August 30 should make life easier for those who don’t want the hassle of managing their own Hadoop cluster, whether on-premise or on the cloud. Singh highlighted the fact that Altus Data Warehouse customers running on AWS can directly query their S3 data, which will eliminate the need to first bring it into HDFS.

What’s more, Cloudera is delivering a hybrid solution that allows customers to move their SQL data warehousing workloads back and forth from on-prem to the cloud. Singh says Cloudera supports the movement of metadata between on-prem and Altus cloud clusters through its Backup and Disaster Recovery offering.

“You can use the BDR product to move data from on-prem, push it to the cloud, and then be able to query it through S3,” he says. “The feedback we’ve gotten is people like the story of hybrid cloud and multi-cloud. Our customers are very mature enterprise, some of them are 100 years old. For them, there is no one moment where they say, ‘Oh I’m only going to use public cloud for now on and it will be AWS.'”

Cloudera is working on Altus for Google Cloud, but has not yet announced when it will be available. As for Microsoft’s cloud, Singh assures us that the company is investing a lot of development dollars in Azure.

Related Items:

Cloudera’s Vision for Cloud Coming Into Focus

Cloudera Bringing Impala to AWS Cloud

Editor’s note: This story has been updated to reflect Cloudera’s current product names.