Managing MapReduce Applications in a Shared Infrastructure
An organization’s Big Data is key to customer insight and product design superiority. The story about the growth of Big Data enabling the crunching of vast amounts of data has been well covered in the media the past couple of years. Apache Hadoop in particular has been positioned as the premier technology solution to help organizations create this insight and product superiority. Hadoop is an innovative and popular framework for developing and running large and scalable data analytics applications such as data mining or parameter studies. It is an open implementation of the MapReduce algorithm and includes HDFS (Hadoop Distributed File System) for high throughput access to distributed data.
What has been less visible for some time is the convergence of Big Compute and Big Data infrastructure. For close to 20 years various architectures have been applied to High Performance Computing, which is well established in many leading product companies or research organizations as the key to timely product innovation, development and delivery.
Common uses of Big Data tools like Hadoop includes the computation of massive unstructured data often associated with social media companies like Yahoo!, Twitter and Facebook. This includes MapReduce applications to exploit and analyze large-scale clickstream analytics, advertising optimization and click-fraud prevention. However, its use in replacing or augmenting specific bottlenecked steps in existing pipelines and workflows such as ETL processes is increasing, particularly within enterprises.
As Hadoop expands its penetration of the enterprise data center certain requirements will be applied including the need to embrace and co-exist with decades of development in existing and proven workflows and infrastructure. In organizations that have embraced high performance computing (HPC) and deployed technical computing clusters this preordains the need to consider integration to co-exist. The days of building application silos using over-provisioned infrastructure are mostly long gone. With the rapid growth of virtualization underpinning server consolidation, and the more recent advent of private clouds enabling both sharing and broad ease of timely access, it’s no wonder why many organizations are seeking ways to effectively tie Big Data applications to their existing Big Compute (HPC) clusters.
Since Big Compute and Big Data share common architectural constructs – for example, both commonly use commodity hardware that is tied together in a cluster and shared among users – integrating the two environments is not only possible, it is becoming commonplace. Further, the prospect of avoiding the expense of procuring a stand-alone Hadoop cluster saves considerable money and can accelerate the progression of new MapReduce applications from dev/test to production.
Univa Grid Engine – A Shared Infrastructure for Applications including MapReduce
Grid Engine is the industry-leading distributed resource management (DRM) system used by thousands of organizations worldwide to build large compute cluster infrastructures for processing massive volumes of workload. For those not familiar with Grid Engine, it plays a vital role in managing policy-based access to a large set of shared servers (a cluster). End users – typically engineers, scientists or researchers – schedule application runtimes (workload) by submitting the request to Grid Engine which then determines the best server to send the job (application) to for execution where it runs under its control. Typical applications include crash-test automation, analytics, simulation and regression analysis across broad industries like Automotive Manufacturing, Life Sciences, Semiconductor Design and Oil & Gas. Using such a highly scalable and reliable DRM system like Grid Engine enables companies to produce higher-quality products, reduce time to market, and simplify the computing environment.
Univa Grid Engine is the successor product to Sun Grid Engine and comes complete with a global customer program that includes more than support. Univa Grid Engine is a drop-in replacement for Sun and Oracle Grid Engine users and in addition includes technical support from an experienced team with predetermined response times as low as 4 hours (and often much faster) and access to continued product development and add-on products.
Univa Grid Engine Integration for Hadoop
Sun Microsystems created an integration for Hadoop MapReduce with Sun Grid Engine 6.2U5 back in January 2010, however, Univa and many users found it was exceedingly difficult to install and configure the integration. Working with several customers we decided to create a new and improved tight integration that was streamlined, simpler to implement and more akin to a supportable product. This new integration unifies Big Compute and Big Data workload in a single cluster in which Univa Grid Engine is configured to treat MapReduce jobs as tightly integrated parallel workload. The instantiation of the distributed Hadoop MapReduce engine in the Grid Engine cluster is fully automated.
Benefits of Managing MapReduce Applications With Univa Grid Engine
Managing MapReduce applications by integrating with Univa Grid Engine adds enterprise-level features and capabilities that simply do not otherwise exist .
First, the integration creates a shared resource pool that can be used for Hadoop deployments as well as any other workload an organization chooses to run in the shared environment. By default Hadoop assumes that all hosts are under its control and will not recognize that other workloads may be executing on the same server it may choose to place a job (that is, by default it doesn’t share). A standalone Hadoop cluster leads to higher costs and low utilization rates which means valuable (and expensive) servers sit idle. Second, the integration abstracts the user such that she does not need to be concerned with where an application might run, or if a job might conflict with other workloads.
“[Univa] Grid Engine certainly allows us to go live with a new component called Aggregator on Hadoop without a major investment. If we didn’t have [Univa] Grid Engine, it would be a major investment to go live on Hadoop because we would have to build a new cluster and incur all the costs around that.” Katrina Montinola, VP Engineering, Archimedes Model
The Univa Grid Engine Integration for Hadoop enables the sharing of a single infrastructure with Hadoop and other applications. While this capability is on the Apache Hadoop roadmap, it will take considerable time to develop and bring to market a solution that is comparable. With the Univa Grid Engine Integration for Hadoop the following capabilities are available today:
- MapReduce applications would inherit the rich accounting and reporting features available with UniSight an analytics and reporting product included with Univa Grid Engine
- MapReduce applications would be tracked by resource usage over periods of time, with insight into individual jobs, users, user groups and projects.
- Univa Grid Engine conducts aggregated accounting across the shared resource pool for all applications
- Workload can be scheduled with fine-grained control. Even a cluster dedicated to MapReduce applications would benefit from Univa Grid Engine since Hadoop has limited scheduling features.
- For example, Hadoop only supports prioritization and a fair-share policy but it does not, for instance, allow the control of resource consumption per user, per group or per project. Univa Grid Engine has very rich scheduling features to fully optimize resources.
- Univa Grid Engine also operates while embedded in a private, hybrid or public cloud framework created by Univa Grid Engine and UniCloud and by extraction extends the ability to burst into clouds to all suitable applications.
- Since the integration treats MapReduce applications as ‘tightly integrated’ parallel jobs Univa Grid Engine has full control over all aspects of the Hadoop MapReduce engine.
- Ensures clean termination of all processes including workload spawned by MapReduce applications
Where to Start
Today, Univa is the key that unlocks Big Data automation and agility in HPC-based product development. For our customers, Univa enables timely, innovative and cost effective product design, development and delivery. Univa Grid Engine Integration for Hadoop automates the Big Data backend of product development to speed some of the world’s most pre-eminent enterprise and research organizations’ best innovations to market. It is currently in production at several Univa Grid Engine customers. There are clear motivations for managing MapReduce Applications with Univa Grid Engine. Lower cost, time to delivery advantages and improved management top the list but really only scratch the surface.
To learn more or to read our technical “how-to” whitepaper on Managing MapReduce Application with Univa Grid Engine please contact Univa directly or visit our website at http://www.univa.com/resources/white-papers/integration-for-hadoop