The big data community has a secret weapon when it comes to innovation: open source. The granddaddy of big data, Apache Hadoop, was born in open source, and its growth will come from continued innovation in done by the community in the open. Here are eight open source projects generating buzz now in the community.
1. Apache Zeppelin
No other big data projects at the moment is as popular as Apache Spark, the in-memory analytics framework developed at Amplab. But it’s not always easy to work with Spark apps as an end-user. That’s where Apache Zeppelin comes in.
Zeppelin essentially provides a Web front-end for Spark. The mighty Zep brings a notebook-based approach to giving users data discovery, exploration, and visualization of Spark apps in an interactive manner. The software, which is modeled on the IPython notebook, supports Spark and other frameworks, such as Flink, Tajo, and Ignite.
Zeppelin was developed by NFLabs, which is a South Korean big data software company (not a football researcher). Zeppelin is currently incubating as a project at the Apache Software Foundation. Hortonworks is including a technical preview of Zeppelin in its upcoming HDP 2.3 release, and demonstrated how the software could be used in a trucking app during a keynote at this week’s Hadoop Summit in San Jose, California.
During an interview with Datanami, Hortonworks co-founder and architect Arun Murthy identified Zeppelin as one of the most promising Hadoop-related projects he’s keeping an eye on, along with Apache Flink and Project Apex (see below).
2. Apache Flink
Momentum is also building behind this distributed in-memory data processing framework, which can replace MapReduce in a Hadoop cluster and fuses batch and streaming analytics.
The strength of Apache Flink lies in the speed of iteration. The faster data scientists can finish a job, the quicker they can move onto the next problem. The software, which features Java and Scala APIs and runs atop YARN, could be just the ticket for fusing streaming analytics with historical analytics.
Flink was developed by the German company data Artisans and became a top-level project earlier this year. No Hadoop distributors are currently shipping Flink as a fully supported part of their distributions, but that will likely change as more people begin using it.
3. Project Apex
Last week, DataTorrent released the core of its real-time streaming product, dubbed RTS, into the open source realm as Project Apex. The YARN-compatible software is designed to replace Apache Storm and Apache Spark Streaming in the Hadoop stack.
Apex runs in a fault-tolerant manner and comes with more than 70 pre-built operators that Java developers can assemble to build their real-time workflows. The software is often deployed alongside Apache Kafka, which provides the real-time messaging bus to serve data. DataTorrent is working with Hortonworks to get Kafka running directly on Hadoop, via Slider, which is known as Project Koya.
DataTorrent’s John Fanelli says Apex holds an 18 month lead over Storm and Spark Streaming. Making the software open will help to ensure wider adoption and continued innovation of the software, he tells Datanami.
Twitter last week unveiled Heron as the successor to Apache Storm for its own internal streaming analytic system. While Storm helped Twitter analyze huge amounts of data for years, and subsequently open sourced the software to the world in 2011, it’s evident that at this point Storm is petering out.
Twitter’s main goals with Heron were to increase performance predictability, improve developer productivity, and ease manageability, Twitter Engineering Manager Karthik Ramasamy wrote in a blog piece.
While Heron is not available as an open source project yet, it’s widely expected that Twitter will take that step. The bad news for Storm users is that the company that originally developed it has moved on because it was difficult to scale and use (something many Storm users have complained about). The good news is that the Storm API will be carried forward in Heron, making it a plug-and-play replacement for existing Storm apps.
LinkedIn this week announced that it’s open sourcing a pair of technologies that revolve around Kafka, the messaging system it created before giving it to the open source community. These include Pinot, a real-time analytics engine that sits atop Kafka.
LinkedIn has been using Pinot as the backend to store hundreds of billions of records and to power more than 25 analytic products, wrote LinkedIn Technical Lead Kishore Gopalakrishna in a blog post this week. If you used LinkedIn features like “Who Viewed My Profile” or “Who Viewed My Posts,” then you are a Pinot user.
LinkedIn also developed and released Burrow recently because it can be difficult to monitor Kafka data flows, in particular whether the receiver of a Kafka-based data flow is keeping up with the flow of messages, according to LinkedIn Engineer Todd Palino. Burrow helps by digging “through the maze of message offsets from both the brokers and consumers to present a concise, but complete, view of the state of each subscriber,” Palino writes in a blog post.
Airbnb has disrupted the hospitality industry by creating a way to allow people to rent their houses and apartments to travelers. It’s not shy about the role that big data technology plays, and actively participates in open source.
In the past two weeks, Airbnb has released two new products developed by its team of “nerds,” including a machine learning package called Aerosolve. Aerosolve is the internal system that Airbnb uses for its “dynamic pricing” feature. If you’ve ever tried to book a place to stay during a popular event, such as Austin’s SXSW, then you’ve used Aerosolve.
The second open source project released by Airbnb is a pipelining project called Airflow. During a session at Hadoop Summit this week, Airbnb engineer Maxime Beauchemin talked about how everybody who’s worked at Facebook loves its pipelining system. So Beauchemin built something similar at Airbnb. The software, called Airflow, treats jobs as directed acyclic graphs (DAGs) and helps manage how they’re running across various systems.
Open source is the heart of innovation in the big data space, and new projects are popping up all the time. What open source projects have caught your eye? Drop us a line at [email protected].
Apache Flink Takes Its Own Route to Distributed Data Processing
Pivotal Throws in with Hortonworks and Open Source
Why Pay for Analytics When Open Source Is ‘Free?’