The Future of Hadoop Runs on Tez, Hortonworks Says
The Hadoop community has spent much energy over the past two years trying to make Hadoop faster, simpler to program, and easier to extend to other systems. While the introduction of YARN in Hadoop version 2 helped to unhook the framework from its MapReduce roots, the folks at Hortonworks say the next step of the Hadoop journey will ride atop the Apache Tez engine.
Apache Tez is composed of a data processing engine that sits atop YARN and a library of APIs for developers to tap into. The open source software was originally designed as part of the Hortonworks-backed Stinger Initiative to improve the SQL performance of Apache Hive by unhooking it from MapReduce and thereby freeing it from the dependencies and other restrictions that MapReduce imposed on Hive.
Bikas Saha, a Hortonworks engineer and a member of the Apache Tez incubator project, explained to Datanami some of the technical problems that relate to developing and running complex queries for Hive and Pig within the current MapReduce paradigm.
“There’s an impedance mismatch between expressing what you want to compute and what expressive capabilities MapReduce gives you, so developers had to change their query plans and change the way they execute so that it fits the MapReduce model,” Saha explains. “And that was a source of performance loss because they were not executing the queries in the most natural and elegant manner possible.”
The second problem related to the extensibility of MapReduce and the capability to build plug-ins. “Given that MapReduce was an application by itself and not designed to be a framework, making performance improvements in MapReduce and changing how MapReduce interacts with the rest of the data is hard,” Saha says. “Say if you want to create an RDMA transport layer, which is much more efficient than reading data over from the disk. In order to do make those kinds of improvements and those kinds of plugins, you had to go and change a lot of MapReduce code. Because MapReduce was an end-to-end monolith which knew everything about itself, plugging in stuff was really difficult.”
The third problem relates to optimizing code. “MapReduce was a very optimized solution for what it did–mapping, then reducing,” Saha says. “It was very optimized to the extent that it was very hard to add any other optimizations. Things that people were struggling with included, ‘How can I figure out the correct number of mappers based on the size of the cluster’ or ‘How do I figure out the correct number of reducers based on the actual data I am moving, not on what I assumed would be there?'”
Instead of extending a lot of effort to hand-code these changes into MapReduce, the Tez framework automates this work for you. “They should be able to plug in whatever kind of data movement or transfer methodologies that they have without having to change the internals of the system,” Saha says. “And the system by itself will provide a lot of these capabilities to plug in advanced optimizations that you can perform at runtime, so that your queries or data processing work as efficiently and as fast as possible.”
For a detailed technical description of Tez, see Saha’s blog post.
Bringing Tez to Market
When Hortonworks first started talking about Tez in early 2013, there was an expectation that Tez would improve Hive’s SQL performance on petabyte-size databases by a factor of 100. That estimate still stands, says vice president of corporate strategy Shaun Connolly.
“Our approach when we started the Stinger Iniatitive was, we didn’t believe a net new system had to be created from scratch to deliver high-performing interactive SQL,” he tells Datanami, clearly referring to Cloudera’s Impala. “We firmly believe that we could innovate at the lowest level, a la Tez, as well as optimize Hive and its storage format, a la the ORC, the optimized RC file work that’s happened in Hive.”
Just before Christmas, Hortonworks released some test results as part of its Stinger Phase 3 tech preview, which included the latest version 0.20 release of Tez, as well as the latest release of Hive, version 0.12. The tests showed Tez-powered Hive running queries in the 8-10 second range that previously took 1,400 second to complete with MapReduce-powered Hive.
|Within YARN, Tez provides a more fine-grained approach to task management than traditional MapReduce|
The Apache Pig team has started work on Tez, and is seeing similar benefits. “When they ran it on Tez…they got a significant performance improvement” without doing any tuning, Saha says. Those Tez-powered Pig results are expected to be shared at the next meeting, he says.
Thanks to its design, Tez could flourish as a general-purpose execution layer, explains Saha. “Tez is exposing a new set of APIs and libraries and an execution framework that a bunch of different higher-level applications can use to improve the performance as well as the efficiency of the domain that they’re trying to solve,” he says. “Instead of Tez being a solution for a particular thing, it’s the next generation data processing layer on top of which people will build domain specific applications.”
The Cascading group is also exploring Tez, Saha says. But in the end, there will be multiple applications running atop Tez, eh says. “Everybody should be able to benefit from the same shared APIs and shared engine that Tez has, with the added advantage that Tez has been designed from the ground up to also solve the problem that Hadop is hard to configure.”
Connolly predicts Tez–which is expected to ship later this quarter in Hortonworks Data Platform (HDP) version 2.1–will have an impact on the entire Hadoop community. “It’s landing and it’s real. It’s not just roadmap vision,” Connolly says. “We expect Tez to become a common core component across all the vendors. It’s going to be that important, we feel.”