Yellow background with logos

Kafka vs Pulsar

Published August 2022

Pulsar and Kafka achieve the same result. They both guarantee messages reach their intended destination(s). Yet, there are important differences between the two message queues. These differences can make one of the technologies a better fit, depending on your use case.

In this post we cover 8 ways in which Apache Kafka and Apache Pulsar compare, some similar, some divergent. There is no clear “better” technology. Rather, these are two strong technologies that each excel in their respective ways.

8 Ways to Compare Kafka & Pulsar

1- Philosophical Direction. 

The biggest difference between Pulsar and Kafka is their underlying philosophies.

Kafka strives for simplicity.  Kafka is actively making installation and management simpler by working to remove ZooKeeper

Pulsar strives for modularity. Its modular architecture allows for independent scaling of the serving (Pulsar) and storage (BookKeeper) layers.  The drawback to modular architecture is the increased complexity.  When running Pulsar, we must also install Pulsar Broker, BookKeeper, and ZooKeeper.

2- Geo-replication.

One area where Kafka adds complexity is with geo-replication.  Pulsar has built-in data center replication functionality, whereas Kafka requires MirrorMaker.   Both approaches to replication achieve the same result.

Replicating data across data centers is important for several reasons.  Firstly, it improves performance for your users across geo-locations.  And secondly, it is important protection in the event of a cloud failure.  See our post on preparing for a cloud outage for more information.

3- Functionality. 

Both Kafka and Pulsar advertise extensive functionality.  Kafka can perform data processing using Kafka Streams, and Pulsar uses Pulsar Functions.  

The issue here is that message queues should be reliable.  It’s their simplicity of purpose that drives their reliability.  Your message queue should be the most dependable component of your data infrastructure. With each new function you ask it to perform, you are adding complexity.  And that complexity can contribute to failure. 

For instance, if we’re running functions on the Pulsar brokers, then CPU usage becomes a greater liability.  

So when it comes to extra functionality, we say it doesn’t matter how Pulsar and Kafka compare.  We recommend processing messages before or after your message queue.

4- Kafka has better throughput. 

Kafka tends to perform better with large throughput.  For instance, Confluent found that Kafka has a peak throughput of 605 MB/s and Pulsar was 305 MB/s. 

Be careful with benchmarking tests though.  This article in DZone discusses how test results can be misleading.  

Both Kafka and Pulsar will likely give you the same ballpark results.  We recommend running your own proof-of-concept in your environment, on your data.  That will be the most reliable comparison.

5- Pulsar has lower latency with lower throughput.

Pulsar tends to have lower latency with lower throughput. 
 
Pulsar is especially lower latency with the combination of functions and low throughput.  But, remember our cautionary tale about running functions in your message queue.
 

6- Batching data is important.

Whether you decide to use Pulsar or Kafka, you can improve throughput using batching. Batching refers to sending many messages at once through your message queue.  
 
When a message/batch is sent, Kafka/Pulsar will acknowledge receipt of the message/batch.
 
Imagine that 1 message is sent at a time with 100 messages. Kafka/Pulsar will go through the process of sending/receiving 100 acks. 
 
If data is batched by 5 messages, then those 100 messages only require 20 acks.  This saves processing time and increases throughput. 
 
Batching will add latency because there is a delay between when the first message and fifth messages are created.  Whether this added latency will be trivial or significant depends on your use case. 
 
When setting up batching you can set a low linger ms to limit latency.  For instance, let’s take a situation where batch sizes are set to 1000 and linger ms is 1.  If the application generates 200 messages per 1 ms, then the producer will only batch 200 messages.  This is because the 1 ms limit triggers before there are 1000 messages.
 
It’s also important to remember that “real-time streaming” is an ideal. In practice it means about 5 ms for low throughputs. 
 

7- Exactly once semantics.

Kafka ensures exactly once message processing.  This is a crucial guarantee for many use cases.  

Pulsar ensures that there are no duplicate messages stored in Pulsar.  However, Pulsar does not ensure that messages aren’t duplicately read by consumers.

8- Breadth of support and ease of hiring.

Kafka is more widely adopted and has more learning resources. Additionally, because Kafka is more widely adopted, it’s easier to hire Kafka staff.  
 
If you’re new to Pulsar, a great starting point is the short course our team made for Udemy.  It’s available for free.
 
We also have many articles discussing basic and advanced concepts for Kafka and Pulsar on our blog.
 
Kafka and Pulsar support is available from many companies (including Dattell). These companies offer a range of services from hourly consulting to fully managed Kafka and Pulsar on your environment. 

Final thoughts on Pulsar vs Kafka

Both Pulsar and Kafka can guarantee the delivery of messages to intended recipients.  Yet, they differ in complexity and modularity, latency and throughput.  
 
Another key distinction: Kafka ensures exactly once processing at the consumer level.  Pulsar only confirms exactly once in the Pulsar store.
 
Whether using Kafka or Pulsar, we recommend you focus on reliability. You can do this by creating an implementation that is as simple as possible.
 

Looking for support?

Dattell provides 24×7 support and managed services for Kafka and Pulsar on our clients’ environments.