Follow Datanami:
January 15, 2020

Apache Flink Powers Cloudera’s New Streaming Analytics Product

Cloudera today launched Cloudera Streaming Analytics (CSA), a new subset of the Cloudera DataFlow (CDF) platform that is based on the Apache Flink stream processing engine. The new software runs on existing Hadoop clusters and will be positioned to process large amounts of IoT and event data.

Since the merger with Hortonworks one year ago, Cloudera has taken more of an interest in managing streaming data, in addition to data at rest. As part of that effort, it has positioned the former Hortonworks Data Flow product (now CDF), as one of the primary product families next to the recently released Cloudera Data Platform (CDP) that offers the best of Hortonworks and Cloudera Hadoop distributions.

Cloudera has supported Apache Spark and its popular Spark Streaming engine on its Hadoop distribution for some time. It also supports Apache Storm and Kafka Streams. However, some in the community, including Hadoop co-creator Doug Cutting, consider Apache Flink’s dataflow architecture superior in some ways to other approaches, notably Spark Streaming, which originally used micro-batch techniques instead of continuous processing.

Flink certainly impressed Cloudera enough to include it in CSA. The company is positioning CSA running atop Hadoop as an end-to-end platform for a range of streaming use cases, from telco network monitoring and fraud detection to clickstream analysis and content recommendations, and it’s counting on Flink to deliver the goods.

Cloudera says it chose Flink primarily because of its scale, and its capability to process millions of data points and complex events in real time every day. Flink’s fault tolerance, data distribution, communications, and its ability to process data at rest also worked in its favor, according to a blog post written by Cloudera’s head of product marketing, Dinesh Chandrasekhar.

“While Cloudera offers our customers several options for stream processing engines – Storm, Spark Structured Streaming, and Kafka Streams, the addition of Flink to CDF is very significant,” Chandrasekhar wrote. “Storm has been slowly losing favor in the market and in the open-source community and users are looking for a better option to move into. Apache Flink is that option.”

The other frameworks have their own relevant use cases around stream processing and analytics, Chandrasekhar says. “However, Apache Flink has a streaming-first (over batch) approach to processing high-volume streams of data at high-scale, while supporting key features such as stateful streaming, exactly-once delivery, built-in fault tolerance/resilience, and advanced windowing techniques,” he says. “This makes it the default choice for a wider range of stream processing use cases.”

CSA is based on Flink version 1.9.1 and will install on CDP Data Center clusters running YARN. It will be able to read data from HDFS and Apache Kafka topics, and write to HDFS, Kafka, and HBase. CSA pipelines will be defined using Java DataStream and ProcessFunction APIs, and will utilize Cloudera Schema Register to manage the serialization and deserialization of events, according to Cloudera. It will support TLS, Kerberos, and exactly once processing semantics.

“Cloudera Streaming Analytics unlocks net new business value for businesses requiring real-time insights from fast-paced customer experiences,” writes Chandrasekhar. “Apache Flink provides stateful analytics at low latency and high scale to address such needs of today’s businesses.”

Apache Flink emerged from the Stratosphere research project at the Technical University of Berlin in 2009, and became a top-level project at the Apache Software Foundation in 2015. The in-memory framework was supported atop YARN from the beginning, but wasn’t restricted to running on Hadoop, which gave it certain advantages.

Development of Flink was spearheaded by the German company data Artisans, which launched a commercial version of Flink called the dA Platform in 2016. Before being acquired by Chinese Web giant Alibaba one year ago, the company changed its name to Ververica and added support for ACID transactions on streaming data with its Streaming Ledger offering.

Over the years, Flink has attracted a number of customers, including Disney, Netflix, and Uber, all of whom deployed Flink into production.

Related Items:

Alibaba Acquires Apache Flink Backer data Artisans

Flink Delivers ACID Transactions on Streaming Data

Apache Flink Takes Its Own Route to Distributed Data Processing