April 24, 2015

Hadoop’s Next Big Battle: Apache Versus ODP

Alex Woodie

 

When the Open Data Platform launched in February, it effectively split the Hadoop community down the middle, with Hortonworks, Pivotal, and IBM throwing in with the ODP, and MapR and Cloudera keeping their chips on the Apache Software Foundation. The future of Hadoop is still very much an open question, and how the ODP-ASF split plays out could be a big factor in Hadoop’s future.

Everybody seems to agree that the Apache Software Foundation has done an admirable job of getting Apache Hadoop to this point. The open source model—which relies on a community of developers to determine how a product should be developed–is one of Hadoop’s biggest strengths. In fact, open source is viewed as the primary engine driving much of the innovation occurring information technology today, not just in Hadoop but practically everywhere else in the emerging hyper-scale world, including SQL and NoSQL databases, Spark, and OpenStack.

But there is a definite disagreement in the Hadoop community about how the platform should evolve from this point forward. Those who’ve joined the ODP say there’s a need for a second governing body to set a common standard for Hadoop that enterprises and third-party software vendors can stake their claims on. Those who have rejected the ODP say there is no need for a second body, that the ASF already handles that job.

QA for Hadoop

WANdisco provides an interesting case study in the need for standards. The software company, which develops high availability and data replication solutions for Hadoop, recently joined the Open Data Platform as a founding member. The company’s CEO, David Richards, shared his views on ODP-ASF divide with Datanami during an interview last week.wandisco_logo

According to Richards, the fast pace of development in core Hadoop was creating an unnecessary burden on his development staff, particularly for NonStop Hadoop, the company’s flagship product that altered some of the low-level functioning of Hadoop’s NameNode to eliminate it as a single point of failure.

“The Hadoop ecosystem was developing so fast, it felt like dog years,” Richards says. “There was a new release coming out of the distribution vendors every few weeks. That required us to look at how invasive we could be inside Hadoop itself, inside the NameNode.”

That invasiveness with Hadoop’s NameNode took a toll on WANdisco and forced it to engage in lengthy certification efforts with the Hadoop distributors, which was necessary to ensure WANdisco’s enterprise customers that NonStop Hadoop functioned properly. Due to the time and expense, it focused its certification efforts with Cloudera and Hortonworks.

WANdisco eventually abandoned that invasive approach and moved the replication off Hadoop to a proxy server with a new flagship offering, called WANdisco Fusion, which it launched earlier this week. The deep certification sessions are a thing of the past with the new offering.

While WANdisco solved its problem by going off the reservation, so to speak, Richards doesn’t see that as a suitable option for other third-party software vendors looking to ride the Hadoop train to riches and glory. “What’s really important for the market and for vendors is consistency. And by that I mean a consistent set of APIs,” he says. “You can’t have everything moving and changing every five minutes.”

That’s the role that Richards envisions the ODP playing for Hadoop going forward–establishing a standard set of programming interfaces that enterprise Hadoop users and Hadoop software vendors can rely on. In Richards view, the technical innovation continues to occur at the ASF level, but the ODP serves as a quality assurance (QA) filter to standardize how Hadoop exposes itself to the outside world.

“The ODP takes the downstream build from the ASF,” Richards says. “The ODP isn’t a development platform. There’s no engineering happing. The engineering platform is the ASF.”

Redundancy in the Stack

Needless to say, that view doesn’t jibe with the two Hadoop distributors who are conspicuously absent from the ODP board: MapR Technologies and Cloudera. In a blog post yesterday, MapR CEO John Schroeder laid out his objections to the ODP.

“MapR was invited to participate in the Open Data Platform initiative and declined after carefully considering the value to the market place,” Schroeder writes in the blog post. “The announced Open Data Platform benefits Hortonworks marketing and provides a graceful market exit for Greenplum Pivotal.” (Pivotal, of course, has effectively ceased to become a Hadoop distributor; its platform will essentially be Hortonworks going forward.)

MapR co-founder and CEO John Schroeder

MapR co-founder and CEO John Schroeder

Schroeder went on to list several concerns about the ODP. Chief among those are that the ODP “is redundant” with the ASF and that it “solves problems that don’t need solving.” “Companies implementing Hadoop applications do not need to be concerned about vendor lock-in or interoperability issues,” Schroeder says, citing a Gartner survey that found fewer than 1 percent of companies were concerned about lock-in or interoperability.

“Project and sub-project interoperability are very good and guaranteed by both free and paid-for distributions,” Schroeder writes. “Applications built on one distribution can be migrated with virtually zero switching costs to the other distributions.”

The cost of joining the ODP is a sticking point for Schroeder, who raised the question of whether it’s “pay to play.” “The Open Data Platform has not disclosed how governance is done, but it is a different model than the preferred and fair meritocracy used by the Apache Software Foundation,” Schroeder writes.

Those views echo what Cloudera CTO Mike Olson has said about the ODP. “As a vendor-driven consortium, membership is only for enterprises with serious money–it ought to be called the ‘Only Dollars Play’ alliance,” Olson wrote in a blog post in February.

MapR’s Schroeder also objects to the ODP’s definition of Hadoop as well. The focus on MapReduce, YARN, Ambari, and HDFS is biased towards vendors, he says. “HDFS was built to serve as secondary storage for batch Hadoop processing,” he writes. “Many production use cases requiring POSIX-compliant storage replace HDFS with MapR, IBM GPFS, EMC Isilon, or NetApp.”

In the end, the ODP can’t hope to have much of an impact without MapR and Cloudera, which account for about 75 percent of the Hadoop implementations up to this point, Schroeder says. “The Open Data Platform without MapR and Cloudera is a bit like one of the Big Three automakers pushing for a standards initiative without the involvement of the other two,” he writes.

ODP Counterpoint

Richards doesn’t agree. “I think what Cloudera are trying to do today is to say it’s Apache versus ODP, which is absolute total horse****,” he says. “It’s Apache and the ODP. We’re full-blown members of the Apache Software Foundation, and a number of other companies who are members of ODP are too. But unlike other projects like Subversion or the Apache Web server, Hadoop is really complicated–really, really complex. And it needs some platform in which it can become consistent.”

Richards maintains that trying to maintain a stable standard against a background of constantly changing complexity is not an easy thing to do. While the ASF has proven itself capable of being the engineering driver for Hadoop, it’s failed to adequately govern itself. “The pace of innovation needs to continue. But what can’t continue is a completely dynamic, ever-changing thing without definition. Companies won’t accept it. They just won’t use it.”Apache

With so many moving parts, Hadoop is not a single product so much as a bill of materials, Richards says. “The rest of the community needs to understand what is the de facto bill of materials is,” he says. “That, I think, is what the ODFP is trying to do. We absolutely don’t want to be part of anything that even pretends to compete with Apache. But what the market has to have is consistency, and that’s what the ODP is bringing. I wish that all the Hadoop companies were members of it.  But obviously it’s not something that Cloudera [or MapR ]particularly sees. But I think they should.”

Where Does Hadoop Go from Here?

How this plays out is anybody’s guess. While MapR and Cloudera are panning the idea that Hadoop is an overly complex thing that needs to be reined in and made palatable to the wider world, that is definitely a concern that Datanami is hearing from Hadoop customers and software developers.

The ASF did a great job of marshalling Hadoop version 2 into fruition and making YARN the center of an extensible Hadoop world. But now that Hadoop is growing up and is poised to become a more general purpose platform with a bigger audience, it may be time to rethink the semi-chaotic nature of Hadoop development up to this point.

Clearly MapR and Cloudera are worried that Hortonworks is trying to use the ODP to solidify its position as “the” standard for open source Hadoop and to marginalize their offerings, which is a legitimate concern, especially with Pivotal’s exit being timed with ODP’s founding. This obviously complicates things for everybody involved.

The battle of Hadoop is taking place at many levels, including at the ASF and its many sub-projects, and among the Hadoop distributors. With ODP now set to create its own “super-standard” for Hadoop, the Hadoop community faces the real potential of a fork in the spec, and it’s unclear if that benefits anybody.

Related Items:

 Does Hadoop Need a Reality Check?

Making Sense of the ODP—Where Does Hadoop Go From Here?

Share This