Kafka vs. Pulsar

Apache Pulsar™ and Apache Kafka® are two open source, distributed messaging systems that ensure real-time data streaming with exactly once or at least once message delivery.

Both Kafka and Pulsar guarantee messages reach their intended destination(s). However, there are important differences between the two technologies. These differences can make one software a better fit, depending on your use case.

Brief Background on Real-Time Data Processsing

Real-time data processing is critical to many companies. Whether streaming financial, healthcare, entertainment, or other data, companies need to ensure that information is consumed by the intended applications.

And, importantly, the data architecture must be fault-tolerant. This means that data is received even in the case of an application failure.

Data engineers had traditionally used either message queues or publish-subscribe tools to ensure messages reach their intended destination(s). Message queues are helpful but limited because messages are removed from the queue after a single consumer reads them. This approach isn’t compatible with building highly scalable systems.

Distributed messaging systems, like Pulsar and Kafka, offer a better way for ensuring data is received.

Distributed messaging systems build onto the traditional message queue model, adding in components of publish-subscribe tools – specifically consumer groups and broker retention. Combining these two approaches provides the kind of fault-tolerance and reliability needed for many use cases.

Comparing Pulsar and Kafka

Simplicity vs. Modularity

The most significant difference between Pulsar and Kafka is the tension between modularity and simplicity.

Pulsar has a modular architecture. This allows for independent scaling of the serving and storage layers, Pulsar and BookKeeper respectively. While this independent scaling is beneficial, it does come with increased complexity.

In addition to installing Pulsar (broker), we must also install BookKeeper and ZooKeeper. These are two additional components to optimize and maintain.

With Kafka you lose the ability to scale the serving and storage layers independently but get a simpler architecture.

Kafka continues to value simplicity. For instance, Kafka is actively making installation and management simpler (and metadata reads faster) by removing ZooKeeper in place of KRaft.

Geo-replication

Companies often choose to replicate data across multiple data centers. Geo-replication helps in two ways. (1) Geo-replication improves performance when users are spread over a range of geo-locations. (2) Replication across data centers offers important protection against data loss in the event of a cloud failure.

Pulsar has built-in data center replication, whereas Kafka requires MirrorMaker2.

Pulsar supports synchronous geo-replication. This approach forces clients to wait for both local and remote cluster(s) to have the data before considering the data to be received by Pulsar. Synchronous geo-replication ensures that any data Pulsar acknowledges receipt of is stored both locally and remotely. This is a key feature for exactly once and at least once message semantics.

Additionally, Pulsar supports two-way geo-replication with topics that have the same exact name in both clusters. This feature ensures any failovers that may happen do not require code changes to update to the new topic name.

Kafka does not support synchronous geo-replication. In addition, Kafka topics in the remote cluster must have a different name than the original topic.

Extra Functionality

Pulsar and Kafka both offer extensive functionality, via Pulsar Functions and Kafka Streams.

The problem with extra functionality is that it adds complexity. And that complexity can contribute to failure. As an example, if you’re running functions on the Pulsar/Kafka brokers, then CPU usage becomes a greater liability.

It follows then that it doesn’t matter how the extra functionality differs between Pulsar and Kafka. To ensure the reliability of Kafka/Pulsar, we recommend processing messages before or after they pass through the message queue.

Kafka is more reliable

Kafka is more reliable than Pulsar. Because Pulsar is more complex there is more to test with new versions. And with Pulsar’s smaller community, there are less people doing the testing.

This combination of Pulsar’s increased complexity and a smaller community leads to more bugs in new versions.

Pulsar might provide lower latency

With lower throughput use cases Pulsar tends to have lower latency than Kafka. However, there are competing benchmarking tests that show different results (example 1 and example 2)

The biggest latency edge is observed with the combination of low throughput and functions. Of course, I don’t recommend using extra functions, so this benefit is a moot point.

It’s best to run your own testing using your real data and in your environment.

Message processing guarantees

There are three categories of message semantics. At least once guarantees that messages are received, but there could be duplicate messages. At most once guarantees that messages are only sent once, but doesn’t guarantee that messages are received. And exactly once guarantees that messages are both received and not duplicated.

Typically companies require at least once and strive for exactly once semantics.

Both Kafka and Pulsar broadly support at least once messaging.

Exactly once is also supported by both Kafka and Pulsar but might only be available in specific situations. For instance, exactly once processing is possible with Pulsar when using an idempotent producer.

The choice of optimizing for at least once, at most once, or exactly once processing will depend on your project’s individual needs.

Clients

Kafka is a more mature technology with many full featured client producers and consumers. Pulsar is a less mature / less widely adopted technology and has fewer clients.

At this time, Pulsar’s Java client is the only full-featured client. All other language clients are supported by small numbers of people and are missing features.

Ease of hiring and outside support

Kafka is more widely adopted and has more learning resources. For instance, there is a thorough and low cost training series for Kafka available on Udemy. Additionally, because Kafka is more widely adopted, it’s easier to hire Kafka staff.

If you’re new to Pulsar, a great place to start is the free resource my company has available on our website: Pulsar training course.

Fully managed Kafka and Pulsar support is available from many companies (including my company, Dattell – managed Kafka and managed Pulsar).

Final thoughts on Pulsar vs Kafka

Both Pulsar and Kafka do what is expected of a distributed messaging system: they guarantee the delivery of messages to intended recipients. Yet, they differ in complexity, modularity, geo-replication approaches, and reliability.

When choosing between Pulsar and Kafka prioritize the features that are most important for your use case. For instance, geo-replication vs. full-featured clients. And consider running tests using your real data and in your environment.

24x7 Kafka & Pulsar Support & Consulting

Visit our Apache Kafka® & Apache Pulsar™ pages for more details on our support services.

Kafka vs. Pulsar

Kafka vs. Pulsar

Kafka vs. Pulsar

Brief Background on Real-Time Data Processsing

Comparing Pulsar and Kafka

Simplicity vs. Modularity

Geo-replication

Extra Functionality

Kafka is more reliable

Pulsar might provide lower latency

Message processing guarantees

Clients

Ease of hiring and outside support

Final thoughts on Pulsar vs Kafka

24x7 Kafka & Pulsar Support & Consulting

24x7 Kafka & Pulsar Support & Consulting

24x7 Kafka & Pulsar Support & Consulting

Discover more from