Real-time stream processing isn’t a new concept, but it’s experiencing renewed interest from organizations tasked with finding ways to quickly process large volumes of streaming data. Luckily for you, there are a handful of open source frameworks that could give your developers a big head start in building your own custom stream-processing application.
When coupled with an underlying real-time message bus such as Apache Kafka, a stream processing framework can dramatically simplify the development of streaming applications, or what some are calling “continuous applications.” You can pick and choose from numerous pre-built functions to build a stream processing application that’s fit for purpose.
But not all frameworks are equated equal, and some are best used for certain use cases. Since the vast majority of stream processing applications are custom-built affairs, it’s important to select a framework that matches your specific needs.
Here we introduce five of the most popular open source stream processing frameworks, plus NiFi.
Apache Storm is a distributed stream processing framework that was created by Nathan Marz about a decade ago to provide a more elegant way to process large amounts of incoming data. Storm does “for real-time processing what Hadoop did for batch processing,” according to the Apache Storm webpage.
Storm development is based on the concept of a directed acyclic graph (DAG), and the application flow is designed as a topology. Developers are given a series of “sprouts” (to connect to data sources and inject the data into a stream) and “bolts” (which process incoming data and emit new data) that can be used to process data in certain ways.
The framework can be used to develop many different types of applications, including real-time analytics, online machine learning, continuous computation, and extract, transform, and load (ETL) workloads. Storm also supports use for exactly once semantics, which is important for certain applications.
Advantage of Storm include speed and flexibility. It’s been clocked processing more than 1 million tuples per second per node, according to the Storm webpage, which also states: “A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways.”
Marz created Storm in Clojure and Java while working at BackType, which was acquired by Twitter. Twitter released Storm as an open source project back in 2011 and was developed Storm became a Top-Level project at the Apache Software Foundation in 2014 and is included in all major Hadoop distributions.
While it was one of the first of a new-generation of distributed stream processing frameworks, Apache Storm is still actively developed. In fact, the community today announced the release of Storm 2.0.0.
Apache Heron is a real-time, distributed, fault-tolerant stream processing engine that was also created at BackType and Twitter. The software, which was released as open source in 2016, is the successor to Apache Storm, and is API compatible with Storm.
Like Storm, Heron applications are based on a DAG, where sprouts and bolts are assembled in a topology for processing incoming data. However, Heron has several advantages over Storm, including a new scheduler that allows the framework to run on multi-tenant clusters (currently only Mesos). There is also a backpressure mechanism that dynamically adjusts the rate of data flow.
The core advantage of Heron holds over Storm is scalability. According to a Twitter blog post by Karthik Ramasamy (now the CTO of Streamlio), Twitter’s production Heron system delivered throughput that’s 10–14x higher than what its production Storm system could handle.
While Heron offers advantages over Storm, it hasn’t completely displaced Storm. Heron, which is incubating at the ASF, is currently being updated to support Apache YARN and to support Mesosphere DC/OS and Kubernetes. The software was developed in Java and Scala.
Apache Samza is a distributed stream processing framework that emerged from LinkedIn in 2013 to run atop YARN and process data fed via the Apache Kafka message bus (Kafka was also developed at LinkedIn, as we covered in the first story in this series).
LinkedIn developed Samza (in Java and Scala) to address a gap in its processing capabilities – namely, it splits the difference between the nearly instantaneous responses that users get via Remote Procedure Call (RPC) methods and the very long waits that are inherent with getting answers from Hadoop.
Streams are the input and the output for Samza jobs. But according to the Apache Samza project website, streams are more than just a simple message exchange mechanism. “A stream in Samza is a partitioned, ordered-per-partition, replayable, multi-subscriber, lossless sequence of messages,” the group says.
Samza is architecturally similar in some ways to Apache Storm. Storm’s sprouts are similar to stream consumers in Samza, bolts are similar to tasks in Samza, and Storm’s tuples are like messages. However, the topology is not necessarily based on a DAG in Samza. Exactly once semantics are planned for a future release.
Samza became a Top-Level Apache project in 2014, and continues to be actively developed. Last year, LinkedIn announced the release of Samza 1.0, which introduces a new high-level API with pre-built operators for mapping, filtering, joining, and windowing functions. LinkedIn relies on Samza to power 3,000 applications, it stated.
Apache Spark is a popular data processing framework that replaced MapReduce as the core engine inside of Apache Hadoop. The open source project includes libraries for a variety of big data use cases, including building ETL pipelines, machine learning, SQL processing, graph analytics, and (yes) stream processing.
Like Spark itself, Spark Streaming implements distributed and fault-tolerant method for processing large amounts of data – in this case, upon live streams of data (often via Kafka or other message buses). Spark supports exactly once semantics and can be used for stateful applications.
Spark developers can create streaming applications in utilizing the framework’s DataFrames or Datasets APIs, which are available for R, Python, Scala, and Java. With the launch of Spark 2.0 in 2016, Spark was bolstered with the Structured Streaming concept, which allowed developed to create continuous applications using SQL. Then, with the launch of Spark 2.3 in 2018, the project brought support for true real-time processing in Spark Streaming, as opposed to the “micro-batch” approach that it previously used.
Surveys show Spark Streaming is one of the most heavily used libraries in Apache Spark. Databricks, the commercial venture behind Apache Spark, says the concept of “unified analytics” within the Spark umbrella gives Spark users a productivity boost compared to users who continually change frameworks.
Apache Flink is one of the newest and most promising distributed stream processing frameworks to emerge on the big data scene in recent years. Flink was written in Java and Scala, and is designed to execute arbitrary dataflow programs in a data-parallel manner.
In Flink, all processing actions – even batch-oriented ones – are expressed as real-time applications. A Flink dataflow starts with a data source and ends with a sink, and support an arbitrary number of transformations on the data.
Flink exposes several APIs, including the DataStream API for streaming data and DataSet API for data sets. It also offers the Table API, which exposes SQL-like functionality. Hadoop creator Doug Cutting once told Datanami that “Flink is architected probably a little better than Spark.” Several large companies, including Netflix, have adopted Flink over other stream processing frameworks in recent years.
Flink emerged from a German university project and became an Apache Incubator project in 2014. The commercial vendor behind Flink, data Artisans, was recently acquired by Chinese Internet giant Alibaba.
Apache NiFi is an open source, Java-based software project that’s designed to automate the flow of data between different and disparate systems. The software is based on the NiagaraFiles software developed by the National Security Agency, and was released as an open source project in 2014.
While not a stream data processing framework in the classic sense, NiFi can be used to build real-time data processing applications. Instead of coding with a high level API, as with other frameworks, the data flows are configured from a GUI with NiFi, and then executed in parallel via a JVM component that’s deployed to a Web server.
NiFi is based on a flow-based programming model, and utilizes the concept of scalable, directed graphs of data routing, transformation, and system mediation logic. As an event processing platform, NiFi can help users collect, curate, analyze, and act on data in real-time.
NiFi is backed by Cloudera, which acquired Hortonworks, which in turn acquired Onyara, the company that employed the original developers of NiFi. The software is developed today through hthe Apache NiFi community, which also manages subprojects, such as Minifi, which is utilized for edge and Internet of Things (IoT) deployments.
NiFi features prominently today in Cloudera DataFlow (formerly Hortonworks DataFlow), a full platform for managing and analyzing data in montuno. In addition to NiFi and Minifi, Cloudera Dataflow utilizes Kafka, Storm, and Spark components.
Assessing Your Options for Real-Time Message Buses
Real-Time Streaming for ML Usage Jumps 5X, Study Says
Fueled by Kafka, Stream Processing Poised for Growth