Follow Datanami:
April 1, 2013

Shared Infrastructure: Using Proven HPC Products for Big Data

An organization’s Big Data is key to customer insight and product design superiority. The story about the growth of Big Data enabling the crunching of vast amounts of data has been well covered in the media. Apache Hadoop in particular is positioned as the premier technology solution to help organizations create this insight and product superiority. Hadoop is an innovative and popular framework for developing and running large and scalable data analytics applications such as data mining or parameter studies. It is an open implementation of the MapReduce algorithm and includes HDFS (Hadoop Distributed File System) for high throughput access to distributed data.

Common use of Big Data tools like Hadoop includes the computation of massive unstructured data often associated with social media companies like Yahoo!, Twitter and Facebook. This includes MapReduce applications to analyze large-scale clickstreams, advertising optimization and click-fraud prevention that run – at scale – constantly around the clock.  

However, a typical use case in an enterprise may not need to be run around the clock, nor at a comparable scale. Therefore, as Hadoop penetrates enterprise data centers the need to co-exist will increase in importance. The days of building application silos using over-provisioned infrastructure are mostly long gone thanks to the rapid growth of virtualization underpinning server consolidation, and the more recent acceptance of private clouds enabling distribution and sharing. It’s no wonder why many organizations are seeking ways to effectively make Big Data share.

Big Compute (HPC) and Big Data share common architectural constructs – for example, both commonly use commodity hardware that is tied together in a cluster and shared among users. Leveraging the capabilities that are proven in HPC to create a shared infrastructure is not only possible, it is becoming an understood need.  The prospect of avoiding the expense of procuring an expensive stand-alone Hadoop cluster saves considerable money and can accelerate the progression of new MapReduce applications from dev/test to production.

Univa Grid Engine – A Shared Infrastructure for Applications including Hadoop

Grid Engine is the industry-leading distributed resource management (DRM) system used by thousands of organizations worldwide to build large shared cluster infrastructures for processing massive volumes of workload and data. For those not familiar with Grid Engine, it plays a vital role in managing policy-based access to a large set of shared servers (a cluster). End users – typically engineers, scientists or researchers – schedule application runtimes (workload) by submitting the request to Grid Engine which then determines the best server to send the job (application) to for execution where it runs under its control.

Using the highly scalable and reliable Univa Grid Engine enables organizations to create a lower cost environment, produce higher-quality results, reduce time to market, and simplify the computing environment today.

Univa Grid Engine Integration for Hadoop

This integration supports shared and automated Big Data workload in a single cluster in which Univa Grid Engine manages the creation (setup and tear down) of multiple Hadoop clusters to run jobs. The instantiation of the distributed Hadoop MapReduce engine in the Grid Engine cluster is fully managed enabling multiple applications to run in the cluster at the same time.

Benefits of Managing Hadoop With Univa Grid Engine

Managing a shared Hadoop cluster by integrating with Univa Grid Engine adds enterprise-level features and capabilities that simply do not otherwise exist today and will take several years to develop. Doing more with less drives up utilization and efficiencies and that means a lower total cost of ownership.

First, the integration creates a shared resource pool that can be used for Hadoop deployments as well as any other workload an organization chooses to run in the shared environment. By default Hadoop assumes that all hosts are under its control and will not recognize that other workloads may be executing on the same server it may choose to place a job (that is, by default it doesn’t share). A standalone Hadoop cluster leads to higher costs and low utilization rates which means valuable (and expensive) servers sit idle. Second, the integration abstracts the user such that she does not need to be concerned with where an application might run, or if a job might conflict with other workloads.

“[Univa] Grid Engine certainly allows us to go live with a new component called Aggregator on Hadoop without a major investment. If we didn’t have [Univa] Grid Engine, it would be a major investment to go live on Hadoop because we would have to build a new cluster and incur all the costs around that.”

 

Katrina Montinola, VP Engineering, Archimedes Model

Archimedes
The Univa Grid Engine Integration for Hadoop enables the sharing of a single cluster with Hadoop and many other applications. While this capability is on the Apache Hadoop roadmap, it will take considerable time in terms of years to develop and bring to market a solution that is comparable and proven. With the Univa Grid Engine Integration for Hadoop the following capabilities are available today:

  • Hadoop applications would inherit the rich accounting and reporting features available with UniSight an analytics and reporting product included with Univa Grid Engine
    • MapReduce applications would be tracked by resource usage over periods of time, with insight into individual jobs, users, user groups and projects.
    • Univa Grid Engine conducts aggregated accounting across the shared resource pool for all applications
  • Workload can be scheduled with fine-grained control. Even a cluster dedicated to Hadoop applications would benefit from Univa Grid Engine since Hadoop has fairly primitive scheduling features.
    • For example, Hadoop only supports prioritization and a fair-share policy but it does not, for instance, allow the control of resource consumption per user, per group or per project. Univa Grid Engine has very rich scheduling features to fully optimize resources.
  • Univa Grid Engine also operates while embedded in a private, hybrid or public cloud framework created by Univa Grid Engine and UniCloud and by extraction extends the ability to burst into clouds to all suitable applications.
  • Since the integration treats Hadoop applications as ‘tightly integrated’ parallel jobs Univa Grid Engine has full control over all aspects of the Hadoop MapReduce engine.
  • Ensures clean termination of all processes including workload spawned by MapReduce applications

Univa Grid Engine comes complete with a global customer program that includes enterprise-grade support that many of the largest clusters in the world already rely on. Univa Grid Engine includes technical support from an experienced team with predetermined response times as low as 4 hours (and often much faster) and access to continued product development and add-on products.

Where to Start

Today, Univa is the key that unlocks Big Data automation and agility in HPC-based product development. For our customers, Univa enables timely, innovative and cost effective product design, development and delivery. Univa Grid Engine Integration for Hadoop automates the Big Data backend of product development to speed some of the world’s most pre-eminent enterprise and research organizations’ best innovations to market. It is currently in production at several Univa Grid Engine customers, managing thousands of cores and dozens of applications. There are clear motivations for managing Hadoop Applications with Univa Grid Engine. Lower cost, time to delivery advantages and improved management top the list but really only scratch the surface.

To learn more read our technical “how-to” whitepaper on Managing MapReduce Application with Univa Grid Engine or to contact Univa directly or visit our website at www.univa.com

And read our case study: Reduce Hadoop Operational Cost by 50 Percent: A case study about how Archimedes put Hadoop in production for less with Grid Engine.

Related Article:

On Demand Webinar: Have you cracked the genetic code to sharing Big Data? Watch Archimedes Inc and Univa as they discuss shared infra

Datanami