An Open Source Tour de Force at Apache: Big Data 2016
There’s no question about it: open source drives big data. Some may forget how the big data software we use or write about every day actually gets made. But here at the Apache: Big Data North America conference in Vancouver, British Columbia, the well of innovation that is the open source community is on full display.
Open source software is fundamental to big data, says Roman Shaposhnik, who runs the Apache Incubator project for the Apache Software Foundation (ASF), the main sponsor of this event. “In a way, open source has won in the enterprise,” says Shaposhnik, whose day job is director of open source at Pivotal. “It’s next to impossible to have a proprietary sale in the enterprise these days, unless it’s a value-add component on something that’s essentially an open source platform.”
Apache Hadoop is the big daddy of big data open source projects. Created at Yahoo 10 years ago by Doug Cutting based on ideas from a Google white paper, Hadoop sets the standard for success in open source big data projects. Hadoop’s success triggered a wave of innovation in the open source community, bringing us popular projects like Apache Spark, Apache Storm, and Apache Kafka.
But for every project that explodes onto the big data stage like Spark or Kafka, there are dozens of other projects still waiting to be discovered. Behind the scenes at the ASF, there are dozens of open source projects trying to become the next big thing. So far, about three dozen of them have been granted full top-level status by the ASF, according to this list of Apache projects, while about twice that number of so-called “podlings” are still in the incubation stage under Shaposhnik’s care.
Many of these projects–both top-level and incubating–are on display here at the Apache: Big Data conference (as well as some non-ASF big data projects, such as ScyllaDB). The backers of four projects incubating in the ASF took to the stage during last night’s “Big Data Shark Tank” event to prove their worth to three judges, including Canonical‘s lead product designer Mark Shuttleworth, the ASF’s VP of brand management Shane Curcuru, and Ampool‘s founder Milind Bhandarkar. The sharks considered:
- Apache Geode (incubating), which is based on Gemfire, the distributed in-memory database that Pivotal decided to make open source. Geode is unique among incubating projects in that it has a long track record of success in the real world. But that just goes to show you that Apache projects don’t have to be “new” to be successful–they just have to do something that’s useful (and of course they have to be open source).
- Apache MADlib (incubating), a collection of SQL-based machine learning algorithms for use in Hadoop, Apache HAWQ (also incubating), the Greenplum database, and PostgreSQL. MADlib, which comes out of academia, moved into the ASF incubator last fall and is moving toward general release.
- Apache Streams (incubating) also went before the sharks. This piece of software is based on a commercial product called Activity Streams that essentially creates schemas for fast-moving social data, making it easier for big data practitioners to input and share social data originating in Twitter, Facebook, Instagram, etc.
- Apache S2 Graph (incubating), a high-performance graph database residing atop Apache Hbase, also was presented. However, instead of discussing S2 Graph, the project representative, “Jo” used his allotted
five minutes to discuss the plight of South Korean programmers and the poor state of Google’s Korean-to-English translation algorithms, which pale in comparison to the Japanese-to-English algorithms. (Disguised behind sunglasses and a hoodie, Jo hypothesized that the superior dialog of Japanese pornography was responsible for the advances in Google algorithms, which just goes to show you that open source developers can be unpredictable and funny, even if a little twisted.)
This represent just the tip of the iceberg in terms of interesting big data projects happening at the ASF. From the one-man shows that are hoping to expand, to successful projects looking for an open source boost, the ASF projects come in all shapes and sizes.
Other big data projects that are being presented here this week include:
- Apache SAMOA (incubating) a distributed machine learning framework;
- Apache Marmotta, a library for extending SPARQL constructs to geospatial data;
- Mnemonic, an Apache proposal that presents an in-place structured data processing and computing library for Java apps;
- Apache Kerby, a Kerberos Java binding;
- Apache Eagle (incubating), a security monitoring framework for Hadoop developed by eBay;
- Apache Kylin, a distributed SQL-based OLAP engine for Hadoop;
- Apache Trafodion (incubating), a transactional SQL-based relational database for Hadoop originally developed at Hewlett-Packard;
- Apache Unomi (incubating), a framework developed by Jahia that aims to standardize the personalization of online experience while promoting ethical Web experience management;
- Apache Beam (incubating), an open source implementation of Google’s Cloud Dataflow API for integrating batch and streaming development;
- Apache Impala (incubating), a SQL-based data warehouse that sits atop HDFS, contributed by Cloudera.
There is clearly a wealth of developer talent out there building new software with the hope that it helps somebody do something better or faster–or even becomes the next Hadoop. Many, if not most, of these projects are going through the ASF, which shows you how valuable the ASF has become not only to the big data developer community, but to all of us who benefit from the applications and services riding atop this technology. And for that, we all owe a debt of thanks.