How to Tune Kafka for High Throughput in Cloud Environments

How to Tune Kafka for High Throughput in Cloud Environments

How to Tune Kafka for High Throughput in Cloud Environments

Apache Kafka is known for its ability to handle high-throughput event streaming, but achieving optimal performance in cloud environments requires more than just out-of-the-box configuration. Whether you’re running Kafka on Kubernetes, virtual machines, or using a managed service, tuning for high throughput can make or break your system’s efficiency.

In this post, we’ll walk through key Kafka tuning areas specifically tailored for cloud environments, including broker settings, producer and consumer configurations, and infrastructure considerations.

Understand Your Workload

Before jumping into configuration changes, it’s critical to understand:

  • Message size: Small vs. large messages behave very differently.
  • Message rate: Are you dealing with consistent loads or traffic bursts?
  • Latency tolerance: Can you afford some delay for batching, or is near-real-time critical?

Knowing these factors will guide every other tuning decision.  If don’t have monitoring set up to track metrics, then that will be the first step.  Check out our guides on how to use open source tools for free Kafka monitoring.

Optimize Broker Settings

num.network.threads and num.io.threads

Increase these based on your instance type and expected concurrency. For cloud VMs with 8+ vCPUs, starting with 8-16 threads each is a good rule of thumb.

socket.send.buffer.bytes and socket.receive.buffer.bytes

Set these large enough to avoid bottlenecks (e.g., 1MB), especially in high-latency cloud networks.

message.max.bytes and replica.fetch.max.bytes

Tune these in line with your average and max message sizes.

log.segment.bytes and log.retention.hours

Adjust based on your throughput and retention requirements. Smaller segment sizes can reduce recovery time after failure.

For more information, check out our deep dive on optimizing Kafka brokers.

Tune Producers for Throughput

Batching

Set linger.ms to a few milliseconds (e.g., 5-20ms) to allow batching, and batch.size to 32KB-128KB depending on payload size.

Compression

Use compression.type=snappy or lz4 for better throughput without significant CPU cost.

Acknowledgments

If ultra-low latency is not required, use acks=1 instead of acks=all to reduce replication overhead.

In-Flight Requests

Increase max.in.flight.requests.per.connection (e.g., to 5) to improve utilization but be cautious of message reordering if retries are enabled.

Tune Consumers for Speed

Fetch Size

Increase fetch.min.bytes and fetch.max.bytes to improve throughput. fetch.max.wait.ms can also be tuned higher (e.g., 50ms) to allow for fuller batches.

Parallelism

Run multiple consumer instances and use consumer groups to parallelize consumption.

Commit Strategy

Batch commit offsets instead of committing every message.

Cloud-Specific Considerations

Network Performance

Use enhanced networking options (e.g., AWS ENA, Azure Accelerated Networking). Avoid cross-AZ traffic when possible.

Storage IOPS

Kafka depends heavily on disk I/O. Use SSD-backed storage and monitor IOPS usage closely.

Auto-scaling Caution

If using Kubernetes or cloud auto-scaling, be aware of how pod/container restarts impact broker availability and partition leadership.  If using Kubernetes and having issues, check out our post on issues with deploying Kafka in Kubernetes.

Benchmark and Iterate

There’s no one-size-fits-all tuning. Use tools like:

  • kafka-producer-perf-test.sh
  • kafka-consumer-perf-test.sh
  • Custom workload simulations

Benchmark regularly and monitor throughput, latency, consumer lag, and broker metrics.

Summing it up

Cloud environments add layers of variability to Kafka performance, but with thoughtful tuning, you can consistently hit high-throughput goals. Focus on batching, compression, buffer sizes, and cloud-specific optimizations. And always benchmark based on your workload.

We also have a number of additional articles on our blog that cover many aspects of best practices for Kafka.  Check out Get Higher Kafka Throughput in an Environment With Network Latency.

If you’d like help tuning or architecting Kafka for your cloud environment, contact us — our team has helped organizations stream billions of events per day with low latency and high reliability.

24x7 Kafka Support & Consulting

24x7 Kafka Support & Consulting

24x7 Kafka Support & Consulting

Visit our Apache Kafka® page for more details on our support services.

Scroll to Top

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading