September 6, 2013

Stinger Looking to Tez to Cross 100x Performance Line for Hive

Isaac Lopez

In an effort to put some real-time sting into Apache Hive, a coalition of developers announced project “Stinger” earlier this year – an effort aimed at a 100x increase in Hive’s performance. Currently in phase two of the effort, the group says that they’ve made significant progress, with more on the way.

Earlier this year, we reported on the preliminary results of the project, staffed by collaborating developers from SAP, Yahoo!, Microsoft, Twitter, Facebook and Hortonworks. At the time, the group said they had achieved 35x- 45x performance improvements for common analytical queries using Hive. In a recent article on the Hortonworks web site, developer Carter Shanklin gave an update on the project, explaining that they are nearing the end of phase two, and are making preparations for the Tez-aided push that they expect will take them over the 100x mark.

When Hortonworks launched their Hortonworks HDP 2.0 beta this summer, YARN got all of the accolades and attention. However, hiding in the release behind all the YARN hoopla was a second release of Stinger improvements, which in some ways are more notable than the YARN stuff, given the pervasiveness of the Hive querying tool.

Among these additions was the preview of a new vectorized query engine which Shanklin says makes the map stages far more efficient, boosting performance by another 5x- 10x. According to Shanklin, using TPC-DS Query 95, a complex query that includes a 3-way fact table join, they were able to achieve a 60% speedup on Hive 11 from Hive 10 – with a 4x speed up from there in HDP 2.0 on 200 GB of data. Not bad, but it’s still far from the 100x that they’re promising. That bump, says Shanklin, will come by way of Apache Tez.

While they’ve made great progress on their initiative, Shanklin indicated that a major keystone that the group is aiming for is the integration of Hive on Apache Tez. Launched into incubation at the same time as the Stinger initiative, Apache Tez is an application framework built on YARN which allows the execution of directed acyclic graphs (DAG) of tasks. As developer, Arun Murthy explained, through DAGs, Tez generalizes the MapReduce paradigm to a more powerful framework enabling projects such as Apache Hive, Pig, and Cascading to meet requirements for human-interactive response times and extreme throughput at the petabyte scale.

Shanklin says that Tez is where they believe the threshold of the 100x performance improvement for Hive will ultimately be crossed, turning Hive into a query framework that will respond more in line with “human time,” (i.e. queries in the 5-30 second range) without needing to change the HiveQL interface.

While not currently ready for prime time, Shanklin says that they are inching closer and expect to release this next phase of the project in beta form soon, which of course, is welcome news for developers stuck waiting for their queries to come through while their list of discovery questions pile up.

In the meantime, we’ll continue to follow the progress being made, and look forward to hearing about how these performance improvements make a difference in future applications.

Related items:

Putting Some Real Time Sting into Hive 

Hortonworks Proposes New Hadoop Incubation Projects 

Hortonworks Levels Up With $50 Million Haul