ODPi Tackles Hive with Latest Hadoop Runtime Spec
ODPi today unveiled the second major release of its Runtime Specification that’s geared at setting a standard for Hadoop components to ensure greater interoperability among distributions and third-party products. New additions to the spec include Apache Hive and the Hadoop Compatible File System (HCFS). ODPi also announced more ISVs have completed interoperability testing.
Hadoop isn’t a relational database. But the familiarity that many business analysts have with SQL is helping to drive the popularity of SQL-on-Hadoop solutions, such as Hive. While every major vendor has its own flavor of SQL for Hadoop, Apache Hive, as the oldest relational data store for Hadoop, is arguably the most widely deployed, and (for better or for worse) continues to be the standard by which other SQL-on-Hadoop solutions are measured.
So it was no great surprise to see ODPi tackle Apache Hive consistency with its Runtime Specification 2.0, which it announced today in advance of the Strata + Hadoop World show taking place this week in New York City.
The Hive spec released by the ODPi is based on Apache Hive version 1.2, which is the latest release of the distributed relational data store for HDFS. The organization says the standard will “reduce SQL query inconsistencies across Hadoop Platforms” and ensure that core Hive functionality that will continue to behave in a predictable way as future versions of Hive are released.
Meanwhile, the addition of HCFS to the ODPi Runtime Specification is seen boosting interoperability with Hadoop distributors, other software vendors, and cloud service providers that want to use other file systems in their Hadoop clusters besides just HDFS.
HCFS was established by the Apache Hadoop project to define how other file systems can work with Hadoop components, such as MapReduce and Hive. According to the Apache Hadoop wiki, active development in the HCFS project currently includes GlusterFS, OrangeFS, SwiftFS, and GridGain. Other file systems that are involved in HCFS include Windows Azure BLOB Storage, the CassandraFS, CephFS, CleverSafe Object Store, Google Cloud Storage Connector, Lustre, the MapR FileSystem, Quantcast File System, and the Veritas Cluster File System.
While HDFS is the primary file system used in Hadoop clusters, it’s by no means the only one. MapR has extended HDFS to be compatible with NFS via its proprietary MapR File System, Amazon (NASDAQ: AMZM) uses its S3 object store to backend the Elastic MapReduce service, while IBM (NYSE: IBM) supports GPFS with its BigInsights distribution of Hadoop.
“The trend we are seeing amongst those provide Hadoop platforms is that a key piece of differentiation is the underlying filesystem,” says John Mertic, the Director of Program Management for the ODPi. “This is especially true for cloud vendors. It makes little sense for them to optimize for HDFS when they have a block/object store available that is much better to leverage for their infrastructure.”
Setting a standard implementation for HCFS will help storage and cloud vendors leverage their native storage solutions as part of an ODPi Runtime Compliant Hadoop Platform, and thereby reduce the incompatibilities that end-users face, ODPi says.
Meanwhile, ODPi announced that more big data software vendors have committed to running their products through the ODPi Interoperable Compliance Program. The new vendors commiting to submit their products to compliance testing include DataTorrent, Pivotal, SAS, Syncsort, WanDisco, Xaivent and Zettaset. Currently, Altiscale, ArenaData, Hortonworks, IBM, and Infosys Apache Hadoop Platforms are ODPi Runtime Compliant, the organization says.
Hortonworks, Pivotal, and IBM were among the founding members of the Open Data Platform (ODP) when it launched just before the Strata + Hadoop World show in February 2015. The organization’s goal was to fight the increasing complexity in the Hadoop stack by providing a set of standards for core Hadoop components. Vendors would benefit by getting a “test once, use everywhere” standard.
While ODPi has issued two release of the Runtime Specification, the group is still planning on releasing its first Operations Specification this year, Mertic says. Apache Ambari, which distributors like Hortonworks (NASDAQ: HDP) use as the main operations console for Hadoop, will be part of that spec, but not as the cornerstone piece, Mertic says.
“We spent much of the summer moving our focus from building a spec around Ambari to helping layout the best practices of installing, configuring, and managing applications on a Hadoop platform,” he tells Datanami via email. “Early feedback has been quite positive with this shift, but it has resulted in us going back to the drawing board to a large degree. Coupled with the delay in Ambari 2.4 release, we are definitely behind our initial planned release schedule but the results should have a greater impact on the Hadoop/Big Data ecosystem.”