YARN to Spin Hadoop into Big Data Operating System
Hadoop is about to see a fundamental reset in its base functionality, says Arun Murthy, architect with Hortonworks and the Apache Software Foundation, who says that SQL in Hadoop via YARN is a part of the core of this metamorphosis.
While Hadoop has been garnering plenty of attention for its potential around the enterprise, one of its chief weaknesses has been that it was originally designed as a single application system – namely the batch-oriented MapReduce. As a system that was developed and grown specifically for web-scale data by the likes of Yahoo! and Facebook, this made sense at one point in time, however new trends and enterprise demands are emerging that are changing the paradigm.
One of these fundamental trends that is changing the picture is enterprises viewing “big data” as “all their data,” – not just specific, narrow aspects of it. Firms are looking ways to break down the data silos in their organizations and bring the data together in one central place where it can be accessed. Centrally Storing large amounts of data, of course, is something that Hadoop is strong at, however, once it’s there, bottlenecks can crop up where business analysts may be in competition against each other for cluster resources.
Tools and other capabilities have been designed and implemented to address these potential limitations of Hadoop, including vendor tools such as Platfora, as well as well-known projects such as Hive, Pig, and HBase. However, says Murthy, the YARN project is about opening up the entire framework for use cases that were previously not possible.
“When we set out to build Hadoop 2.0, we wanted to fundamentally re-architect Hadoop to be able to run multiple applications against relevant data sets,” writes Murthy. “And do so in a way where multiple types of applications can operate efficiently and predictable within the same cluster – this is really the reason behind Apache YARN, which is foundational to Hadoop 2.0. By managing the resource requests across a cluster, YARN turns Hadoop from a single application system to a multi-application operating system.”
Earlier this year, Hortonworks CTO Eric Baldeschwieler echoed this sentiment telling an audience that extensibility is a chief focus of the Hadoop 2.0 initiative, referencing YARN as a key foundation of the reworked framework.
According to Cloudera, YARN, which they say is an acronym for “Yet Another Resource Negotiator,” is a framework that facilitates the writing of arbitrary distribute processing frameworks and applications.
“Yarn provides the daemons and APIs necessary to develop generic distributed applications of any kind, handles and schedules resource requests (such as memory and CPU from such applications, and supervises their execution,” says Harsh Chouraria with Cloudera, who says that this means YARN can run applications that do not follow the MapReduce model.
This opens up Hadoop to a whole new paradigm of usage. Where before it could be considered a central storage place where you could run batch analytics, YARN essentially opens the framework up to being a big data operating system of sorts, where multiple applications can be running simultaneously. This means everything from machine learning, to real-time event processing, data modeling and more.
So while Hadoop has been virtually synonymous with MapReduce, it’s about to see what promises to be a fundamentally game-changing shift. These new capabilities are due for release this summer, says Murthy, as part of the Hadoop 2.0 roll-out.