Follow Datanami:
May 14, 2024

StreamNative Bolsters Pulsar Data Streaming Platform with Ursa

(Yurchanka Siarhei/Shutterstock)

StreamNative, the commercial venture behind Apache Pulsar, today rolled out Ursa, a new data streaming engine for its hosted Pulsar environment that is API-compatible with its competitor, Apache Kafka. Ursa also supports data lakehouses via multiple open table formats, giving customers more options for analyzing big streaming data.

Apache Pulsar is a distributed publish and subscribe (Pub-Sub) platform originally developed by Yahoo to overcome the challenges in handling large-scale messaging and streaming data. It was originally released under an Apache license in 2016, five years after LinkedIn donated Apache Kafka to the Apache Software Foundation.

Pulsar was designed to handle the biggest data streaming workloads and to avoid making the architectural decisions that have held Kafka back, according to Sijie Guo, the co-founder and CEO of StreamNative and PMC member of Pulsar at the Apache Software Foundation.

For instance, Pulsar separated compute and storage from the beginning, enabling the two components to scale independently atop Kubernetes in the native cloud manner. It was also multi-tenant, giving it another advantage over Kafka, Guo says. “We have a fundamentally different approach,” Guo tells Datanami.

They also had some similarities. For instance, both Kafka and Pulsar were originally developed to run in a distributed manner using ZooKeeper, which is a limitation that many distributed systems have struggled to overcome.

While Pulsar owns technical advantages over Kafka, those advantages have not translated into increased market share. While there are large open source Pulsar and StreamNative cloud implementations, Kafka has essentially become the defacto industry standard for data streaming, Guo concedes.

As the old saying goes, “if you can’t beat them, join them.” Just as the entire object storage industry has adopted a competing standard in Amazon S3, StreamNative is embracing a competing protocol as a means to gain market share and simplify adoption for its customers.

That’s the idea behind Ursa, a new implementation of the Apache Pulsar streaming data engine that is only available from StreamNative. Ursa, which has been in development for more than a year, builds on the work StreamNative has done in running Kafka and Pulsar environments together by implementing full support for the Kafka protocol.

“[Pulsar] is a very powerful protocol that’s increasingly adopted by enterprises and unicorns, but it requires an application rewrite” for existing streaming applications based on Kafka to use it, Guo says. “So instead of forcing people to use Pulsar protocol, we want to expand the support of Kafka protocol and transition our platform from single protocol to multi-protocol, which provides a greater flexibility and be able to help Kafka users to address their challenges.”

The second new capability that Ursa brings is the capability to store streaming data directly in a lakehouse using a table format, including the users’ choice of Apache Iceberg, Databricks Delta, or Apache Hudi. This feature will give customers more options for analyzing their streaming data using their choice of SQL engine while reducing the complexity that would normally be involved with streaming data into the lakehouse.

(Quardia/Shutterstock)

“We basically simplify this by storing data directly as lakehouse table,” Guo says. “So those businesses generating data in business application can be directly consumed by the analytical process, like Databricks, Snowflake, and Redshift. They can read the data directly there, and it actually enable the data sharing across different teams within a large application.”

Ursa is all about giving customers more options and more flexibility. They can choose either Pulsar or Kafka protocols for ingesting data into the Pulsar platform. And once the data is ingested, they can process the data on the fly if the applications requires the very lowest latencies, such as many of the online gaming, banking, and telecommunication use cases do. Or if latency is not as big of an issue or they need historical data for perspective, they can wait for the data to be routed into the lakehouse, where they can bring existing SQL engines to bear on it.

StreamNative is giving customers the option to use existing stream data processing techniques or use the new lakehouse capability on a per-topic basis, says StreamNative’s Head of Marketing, Amy Krishnamohan.

“In streaming, there are people who care about latency a lot. For those people, they cannot afford for the data to go all the way to the data lake. It takes longer,” she says. “But there are people who’re using it for analytic purposes….and 100 milliseconds is totally fine. We’re giving them an option per topic.”

Apache Kafka systems also give customers options, Krishnamohan points out. Confluent, the commercial venture behind the open source Apache Kafka project and the leading Kafka hosting company, has been embracing Apache Flink for stream processing. It’s also embracing lakehouse storage via open table formats, which it unveiled with its recent Tableflow announcement.

However, there’s one big difference, she says: Whereas StreamNative lets users implement different querying methods on a topic-by-topic basis, Kafka requires uniformity in the querying method for all topics within the Kafka cluster.

“Kafka is per-cluster basis, so all the topics in that single cluster, you have to compute everything the single cluster way, whereas with Pulsar you can divide up the topic and namespace,” she says.

The third pillar of Ursa is the elimination of ZooKeeper and BookKeeper. The company has moved away from the longstanding plumbing for distributed systems and adopted a new approach based on Kubernetes. The Kafka approach to eliminating ZooKeeper, based on the KRaft consensus protocol, is just a re-implementation of ZooKeeper, Guo says.

The folks at StreamNative don’t hold a grudge against Confluent or Kafka. But they are convinced that, besides the Kafka protocol, Kafka clusters just have too much baggage for today’s demanding streaming data businesses.

“Kafka is definitely a standard from a protocol perspective. Kafka API has no problem there because it’s well suited for the use cases like logging, ingestion, building data pipelines,” Guo says. “It’s just the implementation of Kafka itself. It’s not designed for the current environment and that’s why we enhanced Pulsar and developed Ursa.”

Guo is confident that the industry is converging on a set of technologies for particular streaming data use cases, and that those technologies involve the Kafka protocol and Pulsar. He says he wouldn’t be surprised if Confluent even adopts Pulsar at some point, giving it the capability to support multiple protocols and storing data streams as lakehouse tables.

“Every data streaming platform eventually will become Kafka API-compatible. But at the end of day, every data streaming vendor is not truly Kafka,” he continues. “Even Confluent is not Kafka anymore. They are kind of Kafka-compatible. And I believe the direction for data streaming engine going down is having the ability to support multiple protocol, and that that is the approach that we are taking here.”

Related Items:

Free Apache Pulsar Cloud Offered by StreamNative

Apache Pulsar Ready for Prime Time

Streamlio Claims Pulsar Performance Advantages Over Kafka

 

Datanami