Univa
Language Flags

Translation Disclaimer

HPCwire HPC in the Cloud Digital Manufacturing Report Green Computing Report


September 24, 2012

Managing MapReduce Applications in a Shared Infrastructure


Grid Engine EvolvedAn organization’s Big Data is key to customer insight and product design superiority. The story about the growth of Big Data enabling the crunching of vast amounts of data has been well covered in the media the past couple of years. Apache Hadoop in particular has been positioned as the premier technology solution to help organizations create this insight and product superiority. Hadoop is an innovative and popular framework for developing and running large and scalable data analytics applications such as data mining or parameter studies. It is an open implementation of the MapReduce algorithm and includes HDFS (Hadoop Distributed File System) for high throughput access to distributed data.

What has been less visible for some time is the convergence of Big Compute and Big Data infrastructure. For close to 20 years various architectures have been applied to High Performance Computing, which is well established in many leading product companies or research organizations as the key to timely product innovation, development and delivery.

Common uses of Big Data tools like Hadoop includes the computation of massive unstructured data often associated with social media companies like Yahoo!, Twitter and Facebook. This includes MapReduce applications to exploit and analyze large-scale clickstream analytics, advertising optimization and click-fraud prevention. However, its use in replacing or augmenting specific bottlenecked steps in existing pipelines and workflows such as ETL processes is increasing, particularly within enterprises.

As Hadoop expands its penetration of the enterprise data center certain requirements will be applied including the need to embrace and co-exist with decades of development in existing and proven workflows and infrastructure. In organizations that have embraced high performance computing (HPC) and deployed technical computing clusters this preordains the need to consider integration to co-exist. The days of building application silos using over-provisioned infrastructure are mostly long gone. With the rapid growth of virtualization underpinning server consolidation, and the more recent advent of private clouds enabling both sharing and broad ease of timely access, it’s no wonder why many organizations are seeking ways to effectively tie Big Data applications to their existing Big Compute (HPC) clusters.

Since Big Compute and Big Data share common architectural constructs – for example, both commonly use commodity hardware that is tied together in a cluster and shared among users – integrating the two environments is not only possible, it is becoming commonplace. Further, the prospect of avoiding the expense of procuring a stand-alone Hadoop cluster saves considerable money and can accelerate the progression of new  MapReduce applications from dev/test to production.

Univa Grid Engine – A Shared Infrastructure for Applications including MapReduce

Grid Engine is the industry-leading distributed resource management (DRM) system used by thousands of organizations worldwide to build large compute cluster infrastructures for processing massive volumes of workload. For those not familiar with Grid Engine, it plays a vital role in managing policy-based access to a large set of shared servers (a cluster). End users – typically engineers, scientists or researchers – schedule application runtimes (workload) by submitting the request to Grid Engine which then determines the best server to send the job (application) to for execution where it runs under its control. Typical applications include crash-test automation, analytics, simulation and regression analysis across broad industries like Automotive Manufacturing, Life Sciences, Semiconductor Design and Oil & Gas. Using such a highly scalable and reliable DRM system like Grid Engine enables companies to produce higher-quality products, reduce time to market, and simplify the computing environment.

Univa Grid Engine is the successor product to Sun Grid Engine and comes complete with a global customer program that includes more than support. Univa Grid Engine is a drop-in replacement for Sun and Oracle Grid Engine users and in addition includes technical support from an experienced team with predetermined response times as low as 4 hours (and often much faster) and access to continued product development and add-on products.

Univa Grid Engine Integration for Hadoop

Sun Microsystems created an integration for Hadoop MapReduce with Sun Grid Engine 6.2U5 back in January 2010, however, Univa and many users found it was exceedingly difficult to install and configure the integration. Working with several customers we decided to create a new and improved tight integration that was streamlined, simpler to implement and more akin to a supportable product. This new integration unifies Big Compute and Big Data workload in a single cluster in which Univa Grid Engine is configured to treat MapReduce jobs as tightly integrated parallel workload. The instantiation of the distributed Hadoop MapReduce engine in the Grid Engine cluster is fully automated.

Benefits of Managing MapReduce Applications With Univa Grid Engine

Managing MapReduce applications by integrating with Univa Grid Engine adds enterprise-level features and capabilities that simply do not otherwise exist .

First, the integration creates a shared resource pool that can be used for Hadoop deployments as well as any other workload an organization chooses to run in the shared environment. By default Hadoop assumes that all hosts are under its control and will not recognize that other workloads may be executing on the same server it may choose to place a job (that is, by default it doesn’t share). A standalone Hadoop cluster leads to higher costs and low utilization rates which means valuable (and expensive) servers sit idle. Second, the integration abstracts the user such that she does not need to be concerned with where an application might run, or if a job might conflict with other workloads.

“[Univa] Grid Engine certainly allows us to go live with a new component called Aggregator on Hadoop without a major investment. If we didn't have [Univa] Grid Engine, it would be a major investment to go live on Hadoop because we would have to build a new cluster and incur all the costs around that.” Katrina Montinola, VP Engineering, Archimedes Model

The Univa Grid Engine Integration for Hadoop enables the sharing of a single infrastructure with Hadoop and other applications. While this capability is on the Apache Hadoop roadmap, it will take considerable time to develop and bring to market a solution that is comparable. With the Univa Grid Engine Integration for Hadoop the following capabilities are available today:

  • MapReduce applications would inherit the rich accounting and reporting features available with UniSight an analytics and reporting product included with Univa Grid Engine
    • MapReduce applications would be tracked by resource usage over periods of time, with insight into individual jobs, users, user groups and projects.
    • Univa Grid Engine conducts aggregated accounting across the shared resource pool for all applications
  • Workload can be scheduled with fine-grained control. Even a cluster dedicated to MapReduce applications would benefit from Univa Grid Engine since Hadoop has limited scheduling features.
  • For example, Hadoop only supports prioritization and a fair-share policy but it does not, for instance, allow the control of resource consumption per user, per group or per project. Univa Grid Engine has very rich scheduling features to fully optimize resources.
    • Univa Grid Engine also operates while embedded in a private, hybrid or public cloud framework created by Univa Grid Engine and UniCloud and by extraction extends the ability to burst into clouds to all suitable applications.
    • Since the integration treats MapReduce applications as 'tightly integrated' parallel jobs Univa Grid Engine has full control over all aspects of the Hadoop MapReduce engine.
    • Ensures clean termination of all processes including workload spawned by MapReduce applications

Where to Start

Today, Univa is the key that unlocks Big Data automation and agility in HPC-based product development. For our customers, Univa enables timely, innovative and cost effective product design, development and delivery. Univa Grid Engine Integration for Hadoop automates the Big Data backend of product development to speed some of the world’s most pre-eminent enterprise and research organizations’ best innovations to market. It is currently in production at several Univa Grid Engine customers. There are clear motivations for managing MapReduce Applications with Univa Grid Engine. Lower cost, time to delivery advantages and improved management top the list but really only scratch the surface.

To learn more or to read our technical “how-to” whitepaper on Managing MapReduce Application with Univa Grid Engine please contact Univa directly or visit our website at http://www.univa.com/resources/white-papers/integration-for-hadoop 

 

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 
Xyratex

Sponsored Links

Sponsored Whitepapers

Best Practices in Big Data Storage - Sponsored by Cleversafe, Cray, DDN, NetApp, & Panasas

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas

From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Download this Whitepaper...

Big Data, Big Brains – Sponsored By NetApp

04/22/2013 | NetApp

Big data has proven to be one of the most promising yet challenging technologies for both government and industry. But, before IT leaders can harness the full potential of big data, there are key issues to address surrounding infrastructure, storage, personnel, and training.
MeriTalk surveyed 17 visionary big data leaders to find out what they see as the big data challenges and opportunities as well as how government can best leverage big data. Download the “Big Data, Big Brains Report”.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

SGI President and CEO, Jorge Titinger, on Big Data

SGI President and CEO, Jorge Titinger, talks about SGI's history and leadership in HPC and how that has converged into Big Data Solutions.

View Multimedia

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

View Multimedia

More Multimedia

SGI DataRaptor with MarkLogic Database

Job Bank

Datanami Conferences Ad

Featured Events

May 22-23, 2013
Business Intelligence Innovation Summit
Chicago, IL
United States

June 4-4, 2013
The Economist's Information Forum
San Francisco, CA
United States

June 10-13, 2013
Cloud & Big Data Expo
New York City, NY
United States

June 19-20, 2013
GigaOM Structure
San Francisco, CA
United States

June 26-27, 2013
2013 Hadoop Summit
San Jose, CA
United States

June 26-27, 2013
Big Data World Congress
London
United Kingdom

» View/Search Events

» Post an Event