Get Higher Kafka Throughput in an Environment With Network Latency

Increasing Kafka throughput is a priority for many organizations. In this article we explain how to optimize your Apache Kafka brokers, consumers, and producers in order to get higher throughput in an environment with network latency.

For organizations that can benefit from individualized assistance, we offer a range of support services, including audits, staff augmentation, long-term consulting, and fully managed Kafka in your environment.

How Network Latency Works

The time it takes to send/receive data to the Apache Kafka broker is referred to as network latency. Throughput is how many messages per second are processed by a Kafka cluster. As we will illustrate below, network latency impacts throughput.

When a producer sends data to Kafka, the broker must receive and then acknowledge receipt of the message to the producer. If there is a one-way network latency of 5 ms, then that’s a 10 ms round trip network latency per batch of messages.

Assuming Kafka processes batches instantly, we’d be limited to 100 batches per second per thread because of network latency alone. Divide 1,000 by the round trip network latency to determine the throughput.

The same calculation applies to a network latency of 2 ms (250 batches/sec) or 10 ms (50 batches/sec). The lower the latency, the greater the throughput will be at default settings.

Further down in the article we have a throughput calculator to illustrate how network latency effects how many messages per second can be sent and received through Kafka.

The graph below illustrates the effect of network latency on message throughput. We can see how the throughput of 2 ms network latency is roughly double the throughput of 5 ms network latency.

Data collection notes: For this test we limited the maximum in-flight messages to 1 per broker connection. The default is 5. As a result, the throughput reduced to about one fifth of the default setting.

The same calculation applies to a network latency of 2 ms (250 batches/sec) or 10 ms (50 batches/sec). The lower the latency, the greater the throughput will be at default settings.

Further down in the article we have a throughput calculator to illustrate how network latency effects how many messages per second can be sent and received through Kafka.

The graph below illustrates the effect of network latency on message throughput. We can see how the throughput of 2 ms network latency is roughly double the throughput of 5 ms network latency.

How Network Latency Affects Kafka Throughput

Single thread messages per second at different latencies.

Challenges When Optimizing For Throughput

The main challenge with optimizing an Apache Kafka implementation is considering both positive and negative changes.

For example, setting max in-flight messages per broker connection to a value above 1 will increase throughput without having much effect on end-to-end latency. In fact, max in-flight messages per broker connection is set to 5 by default for many clients. The negative side to a max in-flight messages per broker connection >1 is that messages can be processed out of order even while using keys.

Imagine a client producer sends five batches in the order of 1-2-3-4-5, and a failure occurs with batch 3. The client producer will then retry batch 3, and Kafka will receive the five batches in order of 1-2-4-5-3. Messages will then be processed out of order by the client consumer because Kafka received the messages out of order.

Optimizing Kafka Throughput

There are two adjustments that make the biggest impact on maximizing Kafka throughput: (A) increasing batch size and (B) increasing partition, producer, and consumer counts. There are additional changes that will increase throughput but have a smaller effect. We will cover all of these optimizations below.

Batch Size

Positive:
Sending 1 message per batch, yields a maximum of 100 messages per second in a 5 ms latency environment. If instead there are 1,000 messages per batch, then the maximum throughput increases to 100,000 messages per second.

Keep in mind that there are diminishing returns as the batch size gets larger.

Negative:
If sending 1,000 messages per batch, then the producer waits until it accumulates 1,000 messages before sending the data. Ensure the time it takes to pool the 1,000 messages is less than the round trip network latency. If taking the time to pool messages is longer than the network latency, then batching is no longer increasing throughput. Reduce the batch size.

Check the linger.ms setting for the producer to control how long the producer will wait to send a batch out.

Kafka throughput calculator

We created a calculator to allow Kafka users to easily determine the messages per second as a factor of network latency and batch size.

Kafka Throughput Calculator: Effect of Batching

Network latency (one way)

5 ms

100

Batch size

1 messages

500

1000/([latencybatching2]*2)*[batchsize2]

Maximum throughput

messages / second

Kafka Throughput Calculator: Effect of Batching

Network latency (one way)

5 ms

100

Batch size

1 messages

500

1000/([latencybatching3]*2)*[batchsize3]

Maximum throughput

messages / second

Effect of batch size on Kafka throughput

This chart illustrates how batch size affects throughput. The orange line represents solely the effect of batching on throughput. The blue line shows how additional optimizations yield even more dramatic increases in throughput as batch size increases. Below we outline the additional optimizations.

How Batch Size Affects Kafka Throughput

Producer / Consumer / Partition Count

Increase the amount of messages in-flight by increasing the number of producers, partitions, and consumers. Messages are considered in-flight if they are delivered but not further acted on by the consumer.

Positive:
Here are two examples to show how increasing the number of producers, partitions, and consumers increases messages in-flight.

25 producers with a maximum in-flight messages of 1 has the same maximum throughput as 5 producers with a maximum in-flight messages of 5.
10 partitions allow for a maximum of 10 consumers with the same consumer group.

Negative:
Not all implementations support greater amounts of parallelism. Running additional producers and consumers increases the general load on Apache Kafka. This extra load would prompt updating settings like num.network.threads, socket.send.buffer.bytes, socket.receive.buffer.bytes, and JVM heap size.

Let’s update our message per second equation with batching and increased producer/consumer count.

There is additional overhead with batching and increasing producer, partition, and consumer count. When Kafka and consumer processing times are included, the maximum throughput is closer to 190,000 messages per second rather than 205,000.

Additional Settings for Increasing Throughput

While increasing concurrency/parallelism should be the main focus, here are some ancillary settings to consider.

Producer Settings

Acks:
Changing acks value to zero will change the producer to a “fire and forget” approach. This change will yield greater throughput but will lose data.

Compression.type:
Depending on your data, batches may compress well. Compressing will result in a smaller amount of data being sent over the network. Keep in mind that compressing data takes time. Snappy tends to perform the best.

Buffer.memory:
In some high latency environments latency can cause send buffers to fill. If the producer is crashing, then consider increasing the buffer memory or internal queue size.

Enable.idempotence:
This setting is off by default. If “exactly once” semantics are required, then there will be a greater need to coordinate with the Kafka broker. Throughput will be reduced.

Consumer Settings

fetch.min.bytes:
This controls the smallest batch size a consumer must have to pull data from the producer. Increasing this value may help reduce round trips and improve throughput.

Fetch.max.bytes:
This is the maximum batch size a consumer can pull at once. Increasing this value may help reduce round trips and improve throughput.

Fetch.max.wait.ms:
This is the maximum time the broker will wait for data before sending. Similar to linger.ms setting for the producer.

Max.poll.records:
This setting controls the maximum number of records a consumer will return in a single poll. Increasing this value may help reduce round trips and improve throughput.

Broker Settings

num.network.threads:
This setting controls how many threads the broker uses to handle network requests. In a high-latency environment, the number of network threads may become a bottleneck if there are many slow network connections. Increasing this setting can help maintain throughput under network delays.

socket.send.buffer.bytes:
This is the size of the buffer for outgoing data. In high-latency environments, a larger buffer may be beneficial to accommodate delayed acknowledgments or larger batches.

socket.receive.buffer.bytes:
Similar to socket.send.buffer.bytes, this setting controls the buffer size for incoming data. In high-latency networks, larger buffers help handle delayed messages and ensure smooth data processing.

24x7 Kafka Support & Consulting

Visit our Apache Kafka® page for more details on our support services.

Get Higher Kafka Throughput in an Environment With Network Latency