The Art of Scheduling in Big Data Infrastructures Today
Arun Murthy, architect with Hortonworks, discussed in a recent article that the Hadoop community wanted to “fundamentally re-architect Hadoop….in a way where multiple types of applications can operate efficiently and predictable within the same cluster”. The starting point to do this is YARN (Yet Another Resource Negotiator), which has the potential to “turn Hadoop from a single application system to a multi-application operating system”.
This contrasts with today’s reality of MapReduce in batch mode for which Hadoop became popular. To save you from holding your breath until the end of this article, I believe that YARN, and other similar solutions such as Mesos for example, run the risk of reinventing the wheel. More mature readily available solutions already exist to resolve the scheduling requirements in the Hadoop ecosystem. It will take the likes of YARN or Mesos years to provide the same features and reach the same level of scalability and maturity.
Before we dive deeper into such alternatives, let’s first capture that it is positive to see the Big Data community recognize how the MapReduce paradigm is enabling businesses to analyze Big Data information effectively, but that it is too restrictive for a wider variety of application scenarios. This certainly is no surprise for observers with a scientific and technical computing background, as dozens of algorithmic patterns have evolved over decades to provide optimal performance for any type of problem and data structure. MapReduce, although having gained popularity through applications in social media contexts, represents a niche of the algorithmic spectrum also known as functional programming. It is clearly not an adequate solution for any type of data processing or computing problem.
Additionally, enterprises have an existing set of application stacks for processing data. Even if a part of it is well suited for MapReduce, the rest of the workflow will need to be integrated as well. Moving the vast amounts of data in and out when switching application paradigms is not an option because the time required for transferring data might easily consume all of the benefit users were hoping to gain from speeding up analytics with Hadoop.
We are looking at a situation where a pool of resources, in essence a cluster, grid, or cloud, is being utilized by different workloads, such as MapReduce applications and other, more generic applications. Those workloads, and any corresponding workflow sequence, need to be orchestrated and their resource consumption needs to be managed in line with the business goals of the organization operating the cluster. Requirements like these are well understood and are the domain of workload and resource management systems like Univa Grid Engine.
Univa Grid Engine and similar workload management tools have matured over the past two decades in technical and scientific computing, as well as in HPC, and have evolved to become an essential cog in Big Data infrastructures. Today Univa Grid Engine supports an impressive and essentially open-ended spectrum of use case scenarios. They are being deployed across all market sectors and typically in business and performance- critical situations. Univa Grid Engine, in particular, also supports a scale exceeding 100,000 cores with massive throughput of jobs of any size. Scale and throughput is essential in Big Data environments because Big Data workloads (also known as ‘jobs’) often have comparatively short runtimes, while the amount of data to be analyzed is massive. This results in a large count of jobs to be processed on growing cluster sizes as data volumes and time-to-result requirements increase.
You might ask, “Are those technical computing workload and resource managers suited for Hadoop?” The answer is a simple ‘yes’. Univa Grid Engine already integrates with Hadoop, is Cloudera certified, and is used by commercial customers like healthcare modeling company Archimedes in production. What Univa provides is exactly the functionality that YARN or Mesos is attempting to add to Hadoop.
Univa Grid Engine allows arbitrary numbers of concurrent Hadoop and non-Hadoop workloads to share the same pool of resources without conflict. It enables the various workloads to be combined with workflows that have access to resources in the cluster which are controlled by automated, sophisticated, practice-derived policies and access controls, such as prioritizing workloads, keeping resource consumption restricted or responding to dynamically changing demands. All this with the promise of proven scalability to cluster sizes far and above what a Hadoop installation typically utilizes in the marketplace today.
It is moot to speculate why the Hadoop community is attempting to reinvent the wheel by developing a workload and resource management module i.e. YARN or Mesos, as part of the Hadoop architecture when solutions already exists and have been developed over the past twenty years. Users adopting these approaches will be exposed to a poor experience and deprived of the progress which has been made in two decades of workload and resource management development. Workload and resource management is a state of the art technology and needs not be reinvented today, especially when it is already fully suited to service the requirements of Big Data environments such as Hadoop.
This is not to say that there is no work left to be done or that there are no challenges remaining. Data awareness in scheduling decision making is a requirement, for instance, and so is the integration with all pieces of the Hadoop ecosystem to maximize the benefit. But responding to such challenges from a mature and well advanced starting point is trivial compared to recreating hyper complex middleware.
There is a reason why systems like Univa Grid Engine have made advances for two decades and why they continue to be at the core of the most challenging computer environments on the planet. Such systems have continued to evolve to the needs of the modern data infrastructure, so the wheel has already been invented and developed for Big Data environments that will power the innovations of the future. Hadoop itself will continue to grow its footprint in enterprise computing and will help its users to extract knowledge which was previously not attainable.
About Fritz Ferstl, CTO and Business Development, EMEA
Fritz Ferstl brings 20 years of grid and cloud computing experience to Univa, and as the Chief Technology Officer he will help set technical vision while spearheading strategic alliances. Fritz, long regarded as the father of Grid Engine and its forerunners Codine and GRD, ran the Grid Engine business from within Sun Microsystems and Oracle for the past 10 years, taking it from an upstart technology to the most widely deployed workload management solution in some of the most challenging data center environments on the planet. Under Fritz’s leadership Grid Engine was open sourced and has grown a vibrant community.