SQL-on-Hadoop Test: Each Engine Has ‘Sweet Spots’
Business intelligence has emerged as the top workload for Hadoop, ahead of data science and ETL. That has prompted bench markers to zero in on the performance of leading SQL-on-Hadoop engines for BI use cases.
AtScale, a BI-on-Hadoop specialist, released new benchmark results this week for leading SLQ-on-Hadoop engines, claiming its results reveal which engines are best suited to particular BI scenarios.
Overall, the benchmark survey found that the leading engines varied “depending on the type of query, size of data and other factors.” Each engine has its own “sweet spot,” and “enterprises will find that a blended usage of all engines might fit their company’s goals best,” according to AtScale, which is based in San Mateo, Calif.
The benchmark tested workloads running on Hive, Cloudera Impala and Spark. The results appeared to be a wash depending on the scenario. For example, AtScale said Hive, which is widely viewed as the default for SQL on Hadoop, did not by itself provide the fastest performance in all scenarios.
Meanwhile, the benchmark provided another boost for Spark, which has been making significant inroads in the enterprise. AtScale found that recent upgrades to the cluster computing engine boosted performance on smaller datasets. “We were surprised to find significant performance improvements between Spark 1.5 and 1.6,” AtScale said.
Industry analysts note that increased Hadoop adoption is focused on storage and scale-out capabilities. A shift to analytical workloads on Hadoop requires a deeper understanding of SQL-on-Hadoop tools, they add, particularly as Hadoop is used to tackle BI workloads.
The benchmark tests found that each of the SQL-on-Hadoop engines is sufficiently stable to support BI workloads. Performance results varied depending on the size of datasets and the number of concurrent users.
Spark SQL and Impala performed best on smaller datasets consisting of tables with as many as several million rows of data. Meanwhile, Impala outperformed Hive and Spark SQL in concurrent user testing. Hence, AtScale said enterprises planning to connect large numbers of business intelligence users to their Hadoop platforms should consider Impala as the primary processing engine.
The bench marker attributed the growing ability of SQL-on-Hadoop engines to handle BI workloads to flourishing open source-source innovation. That level of innovation will likely grow as companies like Cloudera make good on plans to donate its Impala project to the Apache Software Foundation. Impala is currently listed as an Apache incubator project.
In a blog post earlier this month, Cloudera said its Impala team has boosted its scale and stability, enabling deployment of Impala clusters with hundreds of nodes and running millions of queries while pushing “concurrency to thousands of users.” It also introduced new features like nested data types and tighter security.
Cloudera engineers also confirmed AtScale’s assertion that one engine does not fit all analytics scenarios. ” Despite Impala’s significant performance lead as an analytic database, Hive and Spark SQL continue to provide important capabilities for other use cases and users alongside Impala,” Cloudera acknowledged.
The Hadoop benchmark study is available here.