Follow Datanami:
June 11, 2012

Cloudera Plots Enterprise Invasion

Nicole Hemsoth

The race for market share among those pitching supported distributions of Hadoop is at its height. Major players in the distro game, including Hortonworks and MapR, are pushing forward with minor tweaks and approaches that tilt the scale, shaking up the still-burgeoning ecosystem around the open source platform.

Among the top runners is Cloudera, which has managed to diversify its offerings with an extensive Hadoop training and certification racket–not to mention some solid use cases for its platform. The company plans to widen the distro-difference gap with a CDH update they claim is enterprise-steeled with enhanced availability, security and integration features.

If the response to Hadoop has been tepid for enterprise users in the past, Cloudera thinks the major concerns, including availability, security and monitoring capabilities, have been put to rest enough that they’re willing to stake the success of their latest platform release (CDH4) on the enterprise-ready assertion.

For a Hadoop distribution to be deemed truly “enterprise-ready”, it needs to go far beyond offering a mere collection of components, says Omer Trajman, Cloudera’s VP of Technology Solutions. He says that an enterprise Hadoop distribution requires cohesion—reliability, extensibility, integration with updates, backwards compatibility, and beefed-up support.

As the Cloudera technology solutions lead told us during an extended conversation in the wake of the CH4 release, when it comes to enterprise-ready Hadoop, customers are looking for a complete solution that closely couples all the elements—the integration and access, the storage, the compute, and of course, the management capabilities.

To these ends, the company announced Cloudera Manager, which they claim will propel enterprises forward in their ability to seamlessly integrate, manage and run away with Hadoop.

Trajman insists that with CDH4 and the new management layer, there are no untested waters. He told us there are no alterations to the core; it’s all Apache under the hood. What it does bring to the table, he argues, is a robust management layer that his company hopes will simplify Hadoop deployment, configurations, management and monitoring.

Cloudera focused on the management layer with feedback from the top user complaints in the forefront—configuration and monitoring. Trajman said that when it came to configuration, these users didn’t want to go through the code base and figure out the smattering of configuration issues.

Secondly, he says that before, systems would operate like a loosely coupled federation of components. “Even though there’s a tight relationship between MapReduce and HDFS, they can run indepdently. HBase runs on HDFS but it’s not tightly coupled with it, so if I’m running a MR job with Hive or Pig, they’re all compatible and work well together, but the user couldn’t see all this easily.” He says these monitoring issues have been worked out via a cohesive monitoring view.

The real star of the CDH4 show, however, lies in the touchy matter of high availability—a topic that plagues Hadoop conversations and is the basis for some initial trepidation about the platform among the mission-critical enterprise users.

Trajman boasts about the high availability focus, stating that when it comes to down time, users don’t care why their systems are down (be it planned or unplanned reasons)—they just want it back up as soon as possible. He says that data nodes have always been very reliable and available, but historically you’ve had to pay special attention to the name node and make sure that it has a reliable system under the covers.

As he told us, “if that name node fails, it takes some serious time to come back, which meant that operators had to carry around a pager and be on alert all the time—just in case.” He says that with CDH, if the active name node fails, the backup takes over and becomes the active. Then when users restore the other one, it becomes the new backup name node (and so on). Thus brings a close to the era of always-on operation and pager and cell-phone availability.

He also points out that when it comes to planned downtime, CHD has added support for heterogeneous clusters, which means it’s possible to roll out a new version on a piecemeal version  without bringing down the entire cluster. 

So the question for many outside of the Hadoop loop is why there’s so much money being tossed around for code that’s completely free from Apache. Trajman says that while its distribution CDH is entirely open source, “the real value here isn’t the code, it’s the specific combination of what [Cloudera] chooses to gather and bring in and how well it gets integrated.

For instance, he points back to the statement about the hassle of working with a collection of components, each with their own needs versus a cohesive platform. He says that without the integration, you’re collecting data, say with Flume, then maybe you’ll Flume that over to HBase, then process it with Pig, then snap it into a Hive table, then export it with something else… and so on.

While all of this possible with enough patience and skill, it is not the way enterprise Hadoop operators would want to spend their valuable (and well compensated) time.

On that note, operators don’t want to spend time worrying about security issues either, says Trajman, who claims that elements of Hadoop are still something of a “free for all.” HBase, for instance, has been outfitted with both column and table-level permissions while other pieces of the Hadoop pie, including the scheduler, have been dosed with an extra spoonful of security in a multi-tenant world.

As Trajman said, “The “fair scheduler” feature, which decides who gets to run jobs on the cluster and where the jobs run can be grouped so that now, only users or groups can submit to certain resource pools.” He says that this means users can now “carve up a cluster and allocate only a certain amount to a particular userbase.”

In addition to being at the core of use cases that rally in the thousands of nodes, Cloudera claims that CDH is unique for large-scale users because unlike others (of course, they wouldn’t name names here) they didn’t just copy the Apache code and tweak it a little or take other versions to build a new distro. Cloudera is making the claim that this is the one that offers a package that users can just straight-up install and go without figuring out the pieces and component gaps.

They point to their always-on support that can see what the cluster is doing at the moment of need and claim that these staff are highly trained and usually in the same time zone given their geographic dispersal over the last couple of years.

Next — The Distro-Difference Gap >>

It’s hard for a company like Cloudera, which offers its basic software for free (CDH) to track how many users are tapping its own brand of Hadoop, but for what it’s worth, Trajman says that his company is becoming the standard for supported Hadoop distros, surpassing companies like Hortonworks and MapR with too-close-to-call differences outside of messaging.

Here is the breakdown, by the way, between Cloudera Manager Free Edition and Cloudera Manager Enterprise Edition

Trajman notes that when it comes to differentiating among the Hadoop distro vendors with supported models, his company “staked a claim on the enterprise from the beginning.” He says that “Hortonworks has a certain spin, MapR has a certain spin, but we’ve always been strictly focused on enterprise, and I think this was a good move in retrospect.”

Like many others we’ve talked with over the course of the last several months, he also noted the need for an independent third-party study that tries to delve into the real market share for each distro (not to mention Apache’s core). Until then, each vendor in the distro business claims superiority—a claim that’s hard to substantiate with no numbers, an open source framework, and, well, just plain marketing b.s.

“If you run Cloudera Enterprise, you could have one or two admins for a few-hundred node cluster. You’d need about five times as many people in order to manage that cluster without the enterprise version—and then twice as many if you’re going to do your own custom builds.”

Related Stories

Six Super-Scale Hadoop Deployments

Open Source Testbed Targets Big Data Development

Inside LinkedIn’s Expanding Data Universe

Datanami