Updated January 2021
Apache Pulsar is an open source publish-subscribe messaging system. It is unique in that it is a two-layer system where the serving and storage layers are separated.
Pulsar runs with two supporting technologies, Apache BookKeeper and Apache ZooKeeper. The three technologies together provide a high throughput, low latency distributed messaging system.
Pulsar Broker - Serving Layer
Pulsar brokers act as the serving layer. A single Pulsar broker owns a single partition topic, and all writes and reads to that partition topic must go through that specific broker.
This design is beneficial because the Pulsar broker can keep the most recent message or messages in memory for faster log tail reads. Additionally, having a single Pulsar broker own a partition topic keeps it aware of the id of the last confirmed entry, which is beneficial in the case of a failure.
BookKeeper - Storage Layer
BookKeeper bookies act as the persistent, immutable storage layer. Typically there are three bookies per topic, at minimum, for redundancy and performance.
BookKeeper stores log entries in ledgers. Ledgers are append-only and immutable data structures. In other words, data cannot be changed, which is an important feature for real-time message streaming.
Each log entry stored in a ledger is an indivisible unit of data. This data comes with metadata fields, including the entryId. The entryId for a log entry must be unique within a ledger. There is also an authentication value used to identify corrupt entries.
In addition to message data, the subscription positions for individual consumers are also persistently stored in BookKeeper. These subscriptions positions for consumers are referred to as cursors. These cursors are stored in ledgers making tracking scalable.
Benefit of Pulsar’s Two-Layer System
Because the serving and storage layers are separate, they can be scaled independently. For instance, if you need to support more producers and consumers, then the number of brokers can be increased. When new brokers are added, the topic partitions are re-distributed across all of the brokers by adding ownership of some topic partitions to the new brokers.
Alternatively, if the amount of storage needs to be increased, then increase the number of BookKeeper bookies. Messages will automatically ramp up on the new bookies. No rebalancing is required.
ZooKeeper for Pulsar
Apache ZooKeeper’s role is for metadata storage, configuration and coordination tasks.
Locally, Pulsar relies on ZooKeeper on the cluster level for cluster-specific configuration and coordination. This runs on a one-to-one basis. For each one Pulsar cluster there must be one dedicated ZooKeeper cluster.
System wide, Pulsar relies on ZooKeeper for configuration management of the entire system, across clusters. This is referred to as the Configuration Store.
Geo-replication & Pulsar Functions
In addition to its excellent performance, which you can read more about here, Pulsar is gaining popularity because of its built-in ability for geo-replication and a feature called Pulsar functions.
Geo-replication is defined on the Pulsar website as the replication of persistently stored message data across multiple clusters of a Pulsar instance (more information here). Geo-replication is a built-in feature in Pulsar. Other distributed messaging systems support geo-replication but only with the assistance of an external tool such as MirrorMaker.
Pulsar functions process simple tasks. These computations can be broken into three steps. First, a message is consumed from one or more Pulsar topics. Then, a user-supplied processing logic or computation is applied to the message. Finally, the results of the processing are applied to another topic to be read by subscribed consumers.
Pulsar functions do not completely remove the need for separate technologies such as Apache Heron, Apache Storm, or Apache Fink, for complicated processing. However, often processing logic or computation is simple enough for it to be handled natively using Pulsar functions.
Handling the computations natively within Pulsar preserves operational simplicity by not requiring additional technology for processing.
Adoption of Apache Pulsar
Pulsar is still a relatively new technology that hasn’t been as widely implemented as Apache Kafka or RabbitMQ. Additionally the community is still relatively small, documentation is limited (but growing rapidly, for instance check out the Apache Pulsar documentation), and in-house expertise is largely unavailable. However, for companies willing to make the switch the benefits can vastly outweigh these current limitations.
For more information about Pulsar and for support, consulting, and fully managed Pulsar services, you can find more information on our Apache Pulsar support and managed services page.