Yahoo’s New Pulsar: A Kafka Competitor?
Yahoo today announced that it’s open sourcing Pulsar, a new distributed “publish and subscribe” messaging systems designed to be highly scalable while maintaining low levels of latency. The bus already backs some of Yahoo’s key apps, and now the Web giant is seeking the help of the open source community to take Pulsar to the next level.
In a post to the Yahoo Engineering blog, Yahoo developers Joe Francis and Matteo Merli explained the application requirements that spurred the creation of the new “pub-sub” messaging system that would become Pulsar.
“These applications provide real-time services, and need publish-latencies of 5ms on average and no more than 15ms at the 99th percentile,” they write. “At Internet scale, these applications require a messaging system with ordering, strong durability, and delivery guarantees.” The messages must also be committed to multiple disks or nodes in order to get to the 99.999% guaranteed durability level, they add.
“At the time we started, we could not find any existing open-source messaging solution that could provide the scale, performance, and features Yahoo required to provide messaging as a hosted service, supporting a million topics,” Francis and Merli write. “So we set out to build Pulsar as a general messaging solution, that also addresses these specific requirements.”
Yahoo designed Pulsar to scale horizontally on commodity hardware, and to provide messaging as a service to multiple applications. The system can scale to handle millions of independent topics and millions of messages published per second, according to Pulsar’s GitHub page.
Developers and administrators interact with Pulsar through a collection of APIs. The software also includes a client library that encapsulates the messaging protocol and handles “complex” functions like service discovery and establishing and recovering connections.
A Pulsar cluster is composed of a set of brokers, BookKeepers (or bookies), and ZooKeeper for coordination and configuration management. A Pulsar instance typically consists of multiple physical clusters that are geographically separated from one another, Yahoo says.
Pulsar uses Apache Bookkeeper (committed by Yahoo to open source in 2011) as its durable storage mechanism. “With Bookkeeper, applications can create many independent logs, called ledgers,” Pulsar’s project page on GitHub says. “A ledger is an append-only data structure with a single writer that is assigned to multiple storage nodes (or bookies) and whose entries are replicated to multiple of these nodes.”
Pulsar uses brokers to serve topics. Each topic is assigned to a broker, and an individual broker can serve thousands of topics, Yahoo says. “The broker accepts messages from writers, commits them to a durable store, and dispatches them to readers,” Yahoo says.
An instance of Apache Zookeeper keeps all the other pieces of Pulsar working together. Yahoo contributed ZooKeeper to the Apache Software Foundation in 2008, and since then the software has become a key component of Apache Hadoop and other big data frameworks.
It appears the use of BookKeeper is key to Pulsar’s high level of durability, and the capability to scale elements of the messaging bus independently. It also offers clues as to why Yahoo developed Pulsar in the first place, and didn’t rely on other open source messaging systems, such as Apache Kafka.
“By using separate physical disks (one for journal and another for general storage), bookies are able to isolate the effects of read operations from impacting the latency of ongoing write operations, and vice-versa,” the Yahoo developers write on their blog. “Since read and write paths are decoupled, spikes in reads – which commonly occur when readers drain backlog to catch up – do not impact publish latencies in Pulsar. This sets Pulsar apart from other commonly-used messaging systems.”
While Kafka was available when Yahoo started developing Pulsar, the technology didn’t offer some of the features that Yahoo’s engineering team required, Yahoo tells Datanami.
Specifically, features like offset (cursor) management, geo-replication, multi-tenancy, and performance under message backlog conditions were not available in Kafka then, and some even aren’t available now, a Yahoo spokesperson says.
Yahoo’s engineering team deployed its first Pulsar instance in the spring of 2015, and use of it has grown quickly since then. Today Pulsar backs Yahoo applications like Mail, Finance, Sports, Gemini Ads, and Sherpa, which is Yahoo’s distributed key-value service. All told, Pulsar publishes more than 100 billion messages per day across 1.4 million topics with an average latency of less than 5 ms.
By making Pulsar available under an Apache 2.0 license, Yahoo hopes to spur development of the messaging bus. Specific areas the company is currently looking to improve upon include decreasing the tiem it takes to migrate tpics among brokers from 10 seconds to less than one second, improving the 99.9-percentile publish latencies to 5ms, and providing additional language bindings for Pulsar.
Yahoo’s Pulsar project is not to be confused with the real-time analytics platform named Pulsar that came out eBay. You can read more about the eBay Software Foundation’s product at gopulsar.io.
October 18, 2021
- Fujitsu Analyzes Japanese Election Data with Foundry from Palantir Technologies
- WANdisco Announces General Availability of LiveData Platform for Azure
- Akridata Joins National Exascale Day Celebrations
October 15, 2021
- Elastic And Optimyze Join Forces to Deliver Continuous Profiling Platform
- Coveo Acquires Qubit
- Aicadium and SambaNova Partner to Bring AI Hardware Solution to Singapore
October 14, 2021
- Kinetica Now Accessible as a Service on Microsoft Azure
- Deloitte Launches CognitiveSpark for Marketing AI Solution
- Alation Acquires Artificial Intelligence Vendor Lyngo Analytics
- WeRide Relies on Alluxio for its Hybrid Cloud Storage Gateway for ML and AI
- FUJI Launches Sustainable Data Storage Initiative
- Logi Analytics Announces Logi Spark 2021 Virtual Conference
October 13, 2021
- Deephaven Community Core with Real-Time Data Capabilities Now Available
- Geospark Analytics Awarded Four-Year Contract from Department of State
- SparkBeyond Unveils No-Code AI Analytics Platform
- Dataminr is Acquiring Krizo, a Real-time Crisis Response Platform
- New Relic Launches Open Source Ecosystem of Quickstarts and Partner Integrations
- LogDNA Introduces Control API Suite to Give Customers More Control
- Elastic Announces Expanded Integrations with Google Cloud
- CrowdStrike Launches Free Humio Community Edition
Most Read Features
- Google Cloud Gives Spanner a PostgreSQL Interface
- One on One with Google Cloud Product Director Irina Farooq
- What Is Data Science? A Turing Award Winner Shares His View
- Big Data File Formats Demystified
- We’re In the Moneyball 3.0 Era. Here’s What It Means for Live Sports
- SambaNova Brings Custom Silicon To Bear on High-End AI Workloads
- Who’s Winning In the $17B AIOps and Observability Market
- What’s the Difference Between AI, ML, Deep Learning, and Active Learning?
- Five Real-World Applications for Sports Analytics
- How the Coronavirus Response Is Aided by Analytics
- More Features…
Most Read News In Brief
- Data and AI Salaries Continue Upward March, O’Reilly Says
- LinkedIn Open Sources Tech Behind 10,000-Node Hadoop Cluster
- Bigeye Observes $45 Million in Funding
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- Gartner Shuffles the Technology Deck with Latest ‘Hype Cycle’ Report
- Why Is SAS Going Public?
- Feature Stores Emerging as Must-Have Tech for Machine Learning
- Sisu Nabs $62M to Grow Data Analytics Biz
- Logistics Operators Look to Data, Technology for Advantage
- An Interactive Analytics Whiteboard for COVID Times
- More News In Brief…
Most Read This Just In
- TIBCO NOW 2021 Showcases Limitless Power of Data
- Databricks Acquires Low-code/No-code Company to Expand its Lakehouse Platform
- Toloka Launches Data Research Grants, Announces First Eight Recipients
- BriefCam Introduces Video Analytics Enabled on Deep Learning Cameras from Axis Communications
- NetApp to Acquire CloudCheckr and Expand its Spot by NetApp CloudOps Platform
- Transaction Processing Performance Council (TPC) Launches an Artificial Intelligence Benchmark (TPCx-AI)
- Indico Data Announces General Availability of Indico Unstructured Data Platform
- Narmi Launches Narmi Analytics: Empowering Financial Institutions to Reclaim Control Over Data
- The Linux Foundation Announces Agenda and Speaker Lineup for the 2021 Linux Foundation Member Summit
- MicroAI to Bring AI Training to Renesas MCUs
- More This Just In…
Sponsored Partner Content
October 19London United Kingdom
October 27 - October 28
November 29 - December 3
December 6 - December 10San Diego CA United States
February 7, 2022 - February 9, 2022Houston TX United States
June 26, 2022 - June 30, 2022Hollywood FL United States