Follow Datanami:
March 2, 2015

Accelerating Hadoop® Workflows to Yield Greater Application Efficiency

As enterprise-critical decision support fully embraces big data, confusion has grown on how to best satisfy increasing demand for ever larger data analytics. Some have questioned whether Hadoop will continue to reliably scale and serve as the primary workhorse for enterprise production level data analytics. Rising to satisfy the need for more scale, truly break-through technologies have recently removed any question mark on how to extend the useful life and scale of enterprise-critical Hadoop applications.

How is this possible? By bypassing the Hadoop Distributed File System (HDFS) and replacing it with Lustre®, the open source file system, know for providing the highest scalable I/O and powering approximately 70% of the worlds high performance computing facilities.

As many organizations utilize Hadoop and big data analytics to make informed business decisions it is important to understand the environment where HDFS is suitable and where deficiencies exists. HDFS was designed to create data locality as is documented in the HDFS Architecture Guide with the statement “Moving Computation is Cheaper than Moving Data.” This approach was optimal at the time when network bandwidth was much lower (typically 1 Gbit/sec Ethernet (GbE)) than it is now (10 GbE, 40 GbE, or even faster with InfiniBand). Another issue is that HDFS is not POSIX-compliant, which makes processing data by non-Hadoop applications typically impossible without prohibitive code re-write. This effectively eliminates the ability of HDFS to support a vast majority of existing HPC applications. However, since Lustre is POSIX-compliant, Lustre is able to provide compatibility support for both Hadoop application environments and HPC application environments accessing data on the same network and storage infrastructure.

The real problem with HDFS is not theoretical scale, it is pragmatic usable scale in time. This is due to the unacceptable amount of time required to move (ingest) data into and out of traditional HDFS-based Hadoop clusters, especially when using low-bandwidth networks. This is a form of practical diminishing returns: the larger the data set, the longer the wait before processing can begin, which inhibits the overall workflow throughput as problem size increases. In fact, the very premise of HDFS “not moving data” is somewhat of a contradiction, since large data sets must be moved into and out of the HDFS based Hadoop cluster before and after each processing run.

Lustre – High Performance Data Repository

Borrowing from the best available high performance computing (HPC) capabilities, Lustre provides the means to store, process and manage all of your enterprise critical big data within one high performance data repository that linearly scales in performance throughput directly in proportion to your problem size measured in data capacity. The key capability that Lustre provides is the means for thousands to tens of thousands of compute clients to directly access data in parallel, over low latency high speed data networks, with data sets as large as tens of petabytes (PB).

Rather than the HDFS design assumption to minimize network traffic during processing, Lustre takes the opposite approach leveraging proven state of the art supercomputing technologies that represent a true break-through on how to extend the useful life and scale of enterprise critical Hadoop applications.

Figure one shows the impact associated to the long transfer of data into HDFS (ingest) before Hadoop processing begins. Figure one also shows Lustre enables Hadoop processing to begin immediately since data is directly accessed by compute clients over low latency high speed networks.


The net-result: this means your enterprise-critical big data stays in one place within the high performance data repository, ready for use immediately with no delay and no more prolonged data movement (ingest) before and after processing. In addition to streamlined workflow and faster results, this reduces the number of secondary storage systems required to manage data transfer into and out of your Hadoop cluster, thus reducing capital cost, operating cost and management complexity.

This article is the first in a series which highlights Hadoop with Lustre, which enables nearly unlimited Hadoop scale, while providing the means to attain high reliability, sustained production levels, and compatibility with your Hadoop data analytics environment.

To learn more on how to make these gains a reality, check out Seagate’s Hadoop Workflow Accelerator which streamlines and accelerates Hadoop application efficiency and results, as well as significantly enhances flexibility and productivity of analytics workload processing and big data centralized repository management.