Follow Datanami:
November 19, 2015

Cloudera Targets Hadoop SQL Workloads with CDH 5.5

Cloudera is aiming to improve how SQL workloads run on Hadoop with today’s release of Cloudera Enterprise 5.5, which brings support for Spark SQL, support for JSON data types in Impala, better security on Impala and Hive, and the beta of a new SQL workload optimization tool.

SQL has been lingua franca for accessing and manipulating data within databases for decades, and so it should come as surprise that SQL is big on Hadoop, even though it’s not a database, per se. The prevalence of SQL skills and SQL interfaces in existing products make it a logical choice to use in Hadoop, even if it can’t do everything. That’s why you have machine learning and graph analytics engines, too.

Cloudera has put a lot of time and money into developing Impala, which essentially provides a Hadoop-based implementation of the type of powerful SQL engines found in massively parallel processing (MPP) databases from the likes of Teradata, Greenplum, and Netezza. The fact that Cloudera is now contributing Impala to the Apache Software Foundation (ASF) shows that it’s serious about driving adoption.

The importance of Impala to Cloudera is clear, which is why you see Cloudera adding support nested data types, such as JSON, to Impala. But not all SQL engines are equal, which is why Cloudera is also bringing support for Spark SQL to the platform, along with Spark’s MLlib machine learning library. That’s also why you see the company continuing to improve on Hive, the Hadoop project’s original SQL engine. (Hive and Impala both get column-level access controls with CDH 5.5.)

With the launch of Cloudera Enterprise 5.5 (which includes CDH 5.5, , Cloudera Manager 5.5 and Cloudera Navigator 2.4) Cloudera is making an extra effort to help users understand which tools are best situated for which workloads. “Hadoop doesn’t need to limit users to one tool that does everything,” says Cloudera product marketing manager Alexandra Gutow. “In fact we discourage that. One tool is never going to do everything well.”

Having so many SQL tools makes figuring out which one to use somewhat difficult. With today’s release of Clouder Enterprise 5.5, Cloudera is introducing a beta of a new service, called Cloudera Navigator Optimizer, that’s designed to help customers gain a greater understanding of SQL workloads running on other systems, and which SQL engine to use if they move those workloads to Hadoop.choose the right SQL engine

Gutow describes Navigator Optimizer as a cloud-based service that generates optimization strategies for Hadoop. Users upload SQL logs from other applications into the tool, and the software, which is based on software Cloudera obtained in its acquisition of Xplain.io, identifies inefficiencies in the workloads.

Several CDH customers participated in a closed alpha of Navigator Optimizer that generated some interesting usage patterns. For instance, ETL workloads tended to run in the wee morning hours, followed by traditional BI queries. After noon, the pattern featured heavy ad-hoc queries by data analysts and data scientists against EDWs, while complex hand-written queries often dominated the hours before midnight.

SQL workloads“We have a lot of customers who are looking to get started with Hadoop and right now there’s not a lot of visibility into what the existing workloads are in the system,” Gutow says. “This has been what’s driving the Navigator Optimizer tool to build workload optimization strategies to address this.

While it’s not a “query optimizer” in the classic sense, the Navigator Optimzer can help a company devise a plan for migrating some workloads, such as ETL and ad-hoc query processing, off “legacy” systems and onto new Hadoop clusters.

“It will identify where the complexities may lie…and ultimately provide the recommendations for which of the workloads are going to run the best, consume the least development time, and going to give you the best results for Hadoop,” Gutow says.

Datanami