Follow Datanami:
July 23, 2014

Enforcing Hadoop SLAs in a Big YARN World

The Apache Hadoop community has done a truly amazing job developing a scalable and versatile platform for big data analytic workloads. And with the recent introduction of YARN in Hadoop 2, we’re now able to run multiple analytic engines on our clusters simultaneously. Unfortunately, the prospect for resource contention has also gone up, and that will likely increase demand for service level agreement (SLA) enforcement.

YARN made its big introduction just as companies started to move their Hadoop deployments out of the science-project phase and into production workloads. When companies start to actually use new applications “in anger,” as it were, a funny thing happens: They start demanding that certain enterprise features be present. Security has been one of Hadoop’s biggest bugaboos up to this point, which is driving the community to develop open source security projects, and leading distributors to acquire Hadoop security startups, as Cloudera and Hortonworks have.

Another enterprise feature that companies will demand will be the capability to enforce SLAs on individual jobs on a granular basis, just as they do with transactional systems. The various Hadoop distros do offer some SLA monitoring features. But according to the folks at Hadoop SLA startup Pepperdata, they don’t go far enough in managing the hardware resources to meet SLAs.pepperdata_logo

Pepperdata’s eponymous product is like a strict traffic cop enforcing stringent SLAs on the Hadoop highway. It works by monitoring every Hadoop job to see how much CPU, memory, disk, and network capacity it uses. If a low priority job is consuming more than its fair share of resources–as defined by policies the user sets up–the software automatically throttles back the offending job, thereby leaving more resources to high priority jobs.

“We’re not doing the same thing distros do,” says Sean Suchter, the CEO and co-founder of Pepperdata. “They do a great job getting the Hadoop clusters up and running and getting it installed on all your nodes. We’re more focused on the performance of Hadoop, and whether it can be used in cases where you have SLAs and you’re really counting on your cluster to run an ETL job every 15 minutes, no matter what else is happening.”

Pepperdata’s software works automatically and in real-time. If it detects a resource contention issue, it can pull certain strings at the kernel or JVM levels to alleviate that specific contention, whether its disk I/O or network I/O or what have you.

Nothing gets by Pepperdata. “Every attempt from everybody on the cluster to use any bit of hardware” is monitored and potentially managed, Suchter says. “We’re aware of all this second by second, which of course the traditional Hadoop system doesn’t do at all. And when we see the conflicts, where a high priority guy who needs an SLA is getting stomped on by somebody with a low priority, like an ad hoc workload, we reach into the ad hoc workload and slow it down just enough that the SLA can be met.”

This fine-grained management capability brings about a 1 percent overhead to the Hadoop cluster. That’s fine, since Pepperdata typically improves capacity utilization by about 60 percent, Suchter says. The software can do that because it ensures different workloads can co-exist peacefully and use the cluster resources without too much yelling and fighting.

pepperdata_screens

Pepperdata monitors a variety of aspects of Hadoop cluster performance

There are other approaches to meeting SLAs. One way is to optimize, tweak, and tune your most important Hbase, MapReduce, Hive, or Spark jobs so they run better and faster on the Hadoop cluster and finish in a shorter amount of time. There are various tools available on the market to help automate the tuning of Hadoop jobs. But that gets you only so far on a busy cluster.

The most common way to meet stringent SLAs is to build a dedicated cluster for the most important jobs. “That’s a way to get SLA guarantees, by completely isolating the hardware,” Suchter says. “The downside to that is it’s ridiculously inefficient, it’s hard to manage, and you lose a heck of a lot of capacity.”

Pepperdata can also assist with capacity planning issues, says Chad Carson, co-founder and vice president of product for the company. “You don’t get out of capacity planning with us because get an instant capacity boost,” he says. “But when you do need an increase, [Pepperdata can tell you] I do or don’t need to buy buffer network, or more spindles or more disk.”

The advent of YARN is a double-edge sword that will inevitably be good for software company’s in the SLA enforcement business. The good part about YARN is that it allows multiple engines to run on Hadoop simultaneously. So we’ll have MapReduce and Hive jobs running at the same time Spark and HBase jobs are running. The downside is that, while YARN ensures that each job has what it needs to start running on the cluster, it doesn’t control how the job will finish.

“We’re really excited. We want everybody to get onto YARN because the faster people get onto YARN, the more likely they’re going to see more things,” Suchter says. “But they still have challenges putting them all on one cluster, because they can still stop on each other at runtime, and they do.”

Pepperdata has been around for a couple of years now, and its product has been available since 2013. About 25 customers are using the software, including some big 1,000-node plus accounts in the social media space, as well as customers in retail and finance. In April it raised $5 million in Series A funding led by Signia Venture Partners and Webb Investment Network, to go along with other funding by Yahoo co-founder Jerry Yang and Ed Zander, the former CEO at Motorola, among others.

The company is in its early stages of growth, and has the potential to grow its business as Hadoop systems move into production “There’s plenty of market we’re able to sell to,” Suchter says. “This is a very viable business and growing rapidly.”

Related Items:

Pepperdata Raises $5M in Funding

Shining a Light on Hadoop’s ‘Black Box’ Runtime

Keeping Tabs on Amazon EMR Performance

 

Datanami