ODPi Defines Hadoop Runtime Spec; Operations Up Next
Today the ODPi issued the first set of documents that describes a standard distribution of basic runtime components for Hadoop, including YARN, HDFS, and MapReduce. Going forward, the organization is preparing a management specification for Hadoop as it considers which Hadoop problem area it will tackle next.
The ODPi was founded a year ago on the eve of the Spring Strata + Hadoop World conference as the Open Data Platform initiative to help reign in some of the complexity that’s impacting Hadoop distributors, software vendors, and users. The problems stem primarily from the rapid proliferation of Hadoop ecosystem components and the ongoing development of existing ones.
With upwards of two dozen separate components making up a distribution, developers and quality assurance (QA) testers were struggling to ensure compatibility from Hadoop distribution to Hadoop distribution; among the various Hadoop components; and with third-party products. One major Hadoop distributor reputedly employs 40 people just to ensure compatibility among products. By standardizing the Hadoop stack, the ODPi hopes to boost compatibility, cut down on complexity and reduce the need for testing, which are becoming big problems that threaten to slow adoption of the platform.
The ODPi Runtime Specification issued today includes three components, including a document describing the standard, a reference implementation of Hadoop based on version 2.7 from the Apache Software Foundation, and a validation and test suite that customers and vendors can use to ensure their software is compatible with the new spec.
The big goal with this release was to cover some of basic stuff around Hadoop, says John Mertic, senior program manager for the ODPi. “We said, Let’s get some of the simple stuff out of the way,'” he tells Datanami. “As you look at it on the surface, it doesn’t look like a ton of meat. But if you look at the issues it addresses, it’s actually fairly useful.”
The spec defines a standard way that Hadoop should be setup and configured, such as naming of JAR files, the location of files, and the presence of standard APIs.
“It seems like it’s fairly obvious, but these are actually really big pain points that ISVs have been running into, and it helps [mitigate] many, many development and QA hours,” Mertic says. “Lets ensure here that vendors aren’t changing public APIs in wild fashion. Lets ensure that an ISV can look at a Hadoop cluster and be able to tell who the vendor that provided it, and a number of other simple things.”
Later this year, the ODPi plans to issue its Operations Specification that digs a little deeper into other parts of the Hadoop stack, in particular Apache Ambari, the management interface used by administrators to provision, manage, and monitor Hadoop clusters. The Operations Spec will define a standard way that Hadoop should be configured for security, for high availability, and for cloud or on-premise deployments, says Roman Shaposhnik, director of open source at Pivotal, which is one of the founding members of ODPi.
ODPi had initially planned to include Ambari in its initial release, but decided to wait. “Were taking a little bit more time to get it right,” Shaposhnik says. “So far everybody has been focused on operating Hadoop in a data center but there is a tremendous amount of a need to standardize how Hadoop gets managed in the cloud. And that actually again is what we’ll be trying to address in the management spec, and that’s why it’s taking more time for us and pushing it into the second release.”
Hortonworks (NASDAQ: HDP), one of the founding members of ODPi, relies heavily on the open source Ambari tool, while Cloudera, which is not an ODPi member, relies on its own Cloudera Manager software.
The ODPi expects to adopt additional Hadoop-related projects into its specification program with every release or every other release, Mertic says. It’s up to the ODPi members to vote on which open source projects make it into the release, he says. “We’re starting to see rumblings of areas to focus on,” he says. “Clearly Spark, HBase, and Hive come up. Those are things that I’ve heard and seen starting to get thrown around.”
Slightly different project names came up at an ODPi meeting held last year, where attendees were encouraged to vote informally on the projects they’d most like to see in the next release.
“The component that bubbled up to the top was Kafka,” Shaposhnik says. “Then of course everybody was talking about Spark. That’s obvious, I guess. But people were also saying it would sure be nice if we had a really nice compliant SQL on Hadoop solution.”
The ODPi currently has 25 members. Software companies must be paying members of ODPi to display the ODPi seal of approval on their products, but anybody can download and test their software or cluster for compatibility.