Follow Datanami:
March 9, 2022

Code for Pulsar, NiFi Tie-Up Now Open Source

(Jurik Peter/Shutterstock)

The code to integrate Apache NiFi with Apache Pulsar is now open source, Cloudera and StreamNative announced today. The integration could be a boon for companies looking to simplify the development of real-time applications atop streaming data flows, and could provide another competitor to Apache Kafka and Confluent.

Apache NiFi is a software framework for creating real-time data flows between different systems using visual development techniques. The software was originally developed by the NSA, and many of the primary engineers for NiFi have worked at Cloudera since 2018, when it acquired Hortonworks (Hortonworks, in turn, bought Onyara, the primary developer of NiFi, back in 2015).

Apache Pulsar, meanwhile, is a distributed messaging and data streaming platform that competes with Apache Kafka and is backed by the commercial outfit StreamNative. The pub/sub system was originally developed at Yahoo, which released it as open source in 2016. Since then, it has been adopted by a number of large companies, including Tencent, Verizon Media, Comcast, and Overstock. Splunk also opted for Pulsar over Kafka to be the core of it’s the Splunk Data Stream Processor (DSP), which it debuted in 2020.

Ostensibly, NiFi and Pulsar are both real-time streaming data systems, but they occupy different levels of the emerging stack. NiFi is more concerned with the practical aspects of automating the movement of large amounts of data (it was originally called Niagrafiles, as a play on Niagara Falls). Pulsar provides the long-term storage of event data and exposes interfaces to other frameworks , like Apache Spark and Apache Flink, for the development of analytics and data applications atop streaming data.

By combining the two systems, customers can get a single place to manage real-time data for short-term and long-term use cases, Cloudera says.

“Apache NiFi and Pulsar’s capabilities complement one another inside modern streaming data architectures,” the company says in its announcement. “NiFi provides a dataflow solution that automates the flow of data between software systems. As such, it serves as a short-term buffer between data sources rather than a long-term repository of data.

Integrating NiFi and Pulsar will bring benefits to customers develoing real-time applicatinos (Image source: Cloudera)

“Conversely, Pulsar was designed to act as a long-term repository of event data and provides strong integration with popular stream processing frameworks such as Flink and Spark,” the company continues. “By combining these two technologies, you can create a powerful real-time data processing and analytics platform.”

The benefits stack up from both sides of the aisle. From the Pulsar point of view, the integration with NiFi brings more dataflow automation capabilities, including a large array of connectors as well as features like prioritization, back pressure, and edge intelligence, the company says.

NiFi users, meanwhile, gain the long-term retention of Pulsar, which can store petabytes of data in a reliable manner, as well as the Spark and Flink interfaces for more sophisticated application development.

“In short, NiFi’s extensive suite of connectors makes it easy to ‘get data in’ to your streaming platform, and Pulsar’s integration with Flink and Spark makes it easy to get real-time insights out,” Cloudera says. “Combining these technologies together creates a complete edge-to-cloud data streaming platform that can be used to provide real-time insights across multiple application domains.”

There are various use cases that will benefit from this integration, including ingesting and parsing log data for cybersecurity; analyzing large amounts of IoT and sensor data in the manufacturing or the oil and gas industry; and real-time processing of ticker data to power algorithmic trading in financial services.

The code that integrates the two frameworks is being distributed by Cloudera in its Cloudera DataFlow Platform (CDF) offering, which is open source. Cloudera says the processors will be available starting with version 7.2.14 of CDF on the public cloud. Customers can also download  the processor from the maven central repository if they want to use them on other NiFi clusters, the company says.

Related Items:

Free Apache Pulsar Cloud Offered by StreamNative

Apache Flink Powers Cloudera’s New Streaming Analytics Product

Hortonworks Boosts Streaming Analytics, IoT Plays with NiFi Deal

Datanami