Follow Datanami:
April 12, 2016

LinkedIn Diagnostics Help Tune Hadoop Jobs

An open source tool released last by LinkedIn developers is intended to help Hadoop and Spark users analyze, tune and improve the performance of their workflows.

The self-service performance-tuning tool for Hadoop dubbed “Dr. Elephant” is designed to automatically gather performance metrics. It then runs an analysis on these data and presents them in a simple format. The goal is to improve developer productivity while increasing cluster efficiency by making it easier to tune Hadoop jobs, LinkedIn (NYSE: LNKD) said.

Dr. Elephant is predicated on the notion that while other components like underlying hardware, networking, operating system and other software can be optimize, “only users have control over optimizing the jobs that run on the cluster,” the company said.

The first hurdle is figuring out how jobs are executed and what resources they consume. Complicating matters, the information required to assess a job is distributed across many systems.

LinkedIn stressed that it has a large Hadoop cluster as well as a growing user community. It runs about 100,000 Hadoop and Sparks jobs each day. As data volumes grow, more build metrics and analytics are run on both platforms. Analysts handle business reporting on a variety of frameworks that include Hive, MapReduce, Pig and others.

“It is more important to optimize for the time of these people than just for the hardware resources they use,” LinkedIn noted.

Early attempts to train employees using different frameworks to run their Hadoop jobs proved inefficient and did not scale as the number of Hadoop users grew. Hence, there was a growing need to standardize and automate the process of training users to optimize their jobs.

The result was Dr. Elephant, LinkedIn’s attempt to optimize both Hadoop developer and user time. The tool automatically gathers the metrics and related runs analyses and presents them in an easy-to-understand format. The goal was “to improve developer productivity and increase cluster efficiency by making it easier to tune the jobs,” the company said.

The tool works by pulling together a list of all recent failed and successful applications via a YARN resource manager. One all metadata is collected, Dr. Elephant runs a set of heuristic techniques on performance data and generates a diagnostic report describing how the overall job performance. Potential performance problems also are tagged.

Since its inception in 2014, LinkedIn said a batch of new features has been added based on suggestions from Hadoop experts and experienced users. Among them are configurable rule-based heuristics for diagnosing jobs along with diagnostic heuristics for MapReduce and Spark. A REST API also has been added to fetch all information on Hadoop jobs.

LinkedIn said it uses Dr. Elephant for monitoring how a workflow is performing on a cluster, understanding why a flow is running slowly and for troubleshooting to tune a flow to improve performance. It has also made the tool compulsory for developers who must get a “green signal” from Dr. Elephant before a flow can run in production.

LinkedIn claimed Dr. Elephant solves about 80 percent of Hadoop workflow problems through “simple diagnosis.”

Recent items:

The Last Hadoop Data Management Tool You’ll Ever Buy?

SQL-on-Hadoop Test: Each Engine Has ‘Sweet Spots’