June 12, 2015

8 New Big Data Projects To Watch

Alex Woodie

The big data community has a secret weapon when it comes to innovation: open source. The granddaddy of big data, Apache Hadoop, was born in open source, and its growth will come from continued innovation in done by the community in the open. Here are eight open source projects generating buzz now in the community.

1. Apache Zeppelin

No other big data projects at the moment is as popular as Apache Spark, the in-memory analytics framework developed at Amplab. But it’s not always easy to work with Spark apps as an end-user. That’s where Apache Zeppelin comes in.

Zeppelin essentially provides a Web front-end for Spark. The mighty Zep brings a notebook-based approach to giving users data discovery, exploration, and visualization of Spark apps in an interactive manner. The software, which is modeled on the IPython notebook, supports Spark and other frameworks, such as Flink, Tajo, and Ignite.

Zeppelin was developed by NFLabs, which is a South Korean big data software company (not a football researcher). Zeppelin is currently incubating as a project at the Apache Software Foundation. Hortonworks is including a technical preview of Zeppelin in its upcoming HDP 2.3 release, and demonstrated how the software could be used in a trucking app during a keynote at this week’s Hadoop Summit in San Jose, California.

During an interview with Datanami, Hortonworks co-founder and architect Arun Murthy identified Zeppelin as one of the most promising Hadoop-related projects he’s keeping an eye on, along with Apache Flink and Project Apex (see below).

2. Apache Flink

Momentum is also building behind this distributed in-memory data processing framework, which can replace MapReduce in a Hadoop cluster and fuses batch and streaming analytics.

The strength of Apache Flink lies in the speed of iteration. The faster data scientists can finish a job, the quicker they can move onto the next problem. The software, which features Java and Scala APIs and runs atop YARN, could be just the ticket for fusing streaming analytics with historical analytics.

Flink was developed by the German company data Artisans and became a top-level project earlier this year. No Hadoop distributors are currently shipping Flink as a fully supported part of their distributions, but that will likely change as more people begin using it.

3. Project Apex

Last week, DataTorrent released the core of its real-time streaming product, dubbed RTS, into the open source realm as Project Apex. The YARN-compatible software is designed to replace Apache Storm and Apache Spark Streaming in the Hadoop stack.

Apex runs in a fault-tolerant manner and comes with more than 70 pre-built operators that Java developers can assemble to build their real-time workflows. The software is often deployed alongside Apache Kafka, which provides the real-time messaging bus to serve data. DataTorrent is working with Hortonworks to get Kafka running directly on Hadoop, via Slider, which is known as Project Koya.

DataTorrent’s John Fanelli says Apex holds an 18 month lead over Storm and Spark Streaming. Making the software open will help to ensure wider adoption and continued innovation of the software, he tells Datanami.

4. Heron

Twitter last week unveiled Heron as the successor to Apache Storm for its own internal streaming analytic system. While Storm helped Twitter analyze huge amounts of data for years, and subsequently open sourced the software to the world in 2011, it’s evident that at this point Storm is petering out.

Twitter’s main goals with Heron were to increase performance predictability, improve developer productivity, and ease manageability, Twitter Engineering Manager Karthik Ramasamy wrote in a blog piece.

While Heron is not available as an open source project yet, it’s widely expected that Twitter will take that step. The bad news for Storm users is that the company that originally developed it has moved on because it was difficult to scale and use (something many Storm users have complained about). The good news is that the Storm API will be carried forward in Heron, making it a plug-and-play replacement for existing Storm apps.

5. Pinot

LinkedIn this week announced that it’s open sourcing a pair of technologies that revolve around Kafka, the messaging system it created before giving it to the open source community. These include Pinot, a real-time analytics engine that sits atop Kafka.

LinkedIn has been using Pinot as the backend to store hundreds of billions of records and to power more than 25 analytic products, wrote LinkedIn Technical Lead Kishore Gopalakrishna in a blog post this week. If you used LinkedIn features like “Who Viewed My Profile” or “Who Viewed My Posts,” then you are a Pinot user.

6. Burrow

LinkedIn also developed and released Burrow recently because it can be difficult to monitor Kafka data flows, in particular whether the receiver of a Kafka-based data flow is keeping up with the flow of messages, according to LinkedIn Engineer Todd Palino. Burrow helps by digging “through the maze of message offsets from both the brokers and consumers to present a concise, but complete, view of the state of each subscriber,” Palino writes in a blog post.

7. Aerosolve

Airbnb has disrupted the hospitality industry by creating a way to allow people to rent their houses and apartments to travelers. It’s not shy about the role that big data technology plays, and actively participates in open source.

In the past two weeks, Airbnb has released two new products developed by its team of “nerds,” including a machine learning package called Aerosolve. Aerosolve is the internal system that Airbnb uses for its “dynamic pricing” feature. If you’ve ever tried to book a place to stay during a popular event, such as Austin’s SXSW, then you’ve used Aerosolve.

8. Airflow

The second open source project released by Airbnb is a pipelining project called Airflow. During a session at Hadoop Summit this week, Airbnb engineer Maxime Beauchemin talked about how everybody who’s worked at Facebook loves its pipelining system. So Beauchemin built something similar at Airbnb. The software, called Airflow, treats jobs as directed acyclic graphs (DAGs) and helps manage how they’re running across various systems.

Open source is the heart of innovation in the big data space, and new projects are popping up all the time. What open source projects have caught your eye? Drop us a line at [email protected].

Pivotal Throws in with Hortonworks and Open Source

Why Pay for Analytics When Open Source Is ‘Free?’

Applications: Predictive Analytics

Technologies: Frameworks, Middleware

Sectors: Financial Services, Healthcare, Retail

Vendors: Airbnb, data Artisians, DataTorrent, LinkedIn, NFLab, Twitter

Tags: big data, Hadoop, open source

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

8 New Big Data Projects To Watch

1. Apache Zeppelin

2. Apache Flink

3. Project Apex

4. Heron

5. Pinot

6. Burrow

7. Aerosolve

8. Airflow

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 10, 2024

May 9, 2024

May 8, 2024

May 7, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

8 New Big Data Projects To Watch

1. Apache Zeppelin

2. Apache Flink

3. Project Apex

4. Heron

5. Pinot

6. Burrow

7. Aerosolve

8. Airflow

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 10, 2024

May 9, 2024

May 8, 2024

May 7, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link