Apache Kafka is a distributed system, running in a cluster with each of the nodes referred to as brokers. Kafka topics are partitioned and replicated across the brokers throughout the entirety of the implementation. These partitions allow users to parallelize topics, meaning data for any topic can be divided over multiple brokers. A critical component of Kafka optimization is optimizing the number of partitions in the implementation.
Since a topic can be split into partitions over multiple machines, multiple consumers can read a topic in parallel. This organization sets Kafka up for high message throughput.
In other words, the greater the parallelization the greater the throughput.
However, you don’t necessarily want to use more partitions than needed because increasing partition count simultaneously increases the number of open server files and leads to increased replication latency.
For most implementations you want to follow the rule of thumb of 10 partitions per topic, and 10,000 partitions per Kafka cluster. Going beyond that amount can require additional monitoring and optimization.
Calculating Kafka Partition Requirements
Here is the calculation we use to optimize the number of partitions for a Kafka implementation.
# Partitions = Desired Throughput / Partition Speed
Conservatively, you can estimate that a single partition for a single Kafka topic runs at 10 MB/s.
As an example, if your desired throughput is 5 TB per day. That figure comes out to about 58 MB/s. Using the estimate of 10 MB/s per partition, this example implementation would require 6 partitions.
For the example above, the number of partitions is set using the following code:
bin/kafka-topics.sh –zookeeper ip_addr_of_zookeeper:2181 –create –topic my-topic –partitions 6 –replication-factor 3 –config max.message.bytes=64000 –config flush.messages=1
The replication factor is set to 3 as a default. While partitions reflect horizontal scaling of unique information, replication factors refer to backups. For a replication factor of 3 in the example above, there are 18 partitions in total with 6 partitions being the originals and then 2 copies of each of those unique partitions.
As with all collected data, you want to ensure information is not lost if there is a failure. Creating replicated partitions is an important component to preventing data loss.
Starting with your partition estimate, it is best then to test partition throughput. Setting up Kafka monitoring will enable you to easily run these tests. Our post on Kafka monitoring with Elasticsearch and Kibana is a good place to start.
Kafka Optimization Resources
We have several other articles on similar Kafka optimization topics to help you with your Kafka implementation.
- Creating a Kafka Topic — Kafka is structured by its four primary components: topics, producers, consumers, and brokers. In this post, we discuss topics.
- Kafka Optimization for General Use Cases — Issues with Apache Kafka performance are directly tied to system optimization and utilization. Here, we compiled the best practices for a high volume, clustered, general use case.
- Kafka Monitoring With Elasticsearch and Kibana — Monitoring Kafka cluster performance is crucial for diagnosing system issues and preventing future problems.
- Kafka vs. RabbitMQ — If you’re looking for a message broker to handle high throughput and provide access to stream history, Kafka is likely the better choice. If you have complex routing needs and want a built-in GUI to monitor the broker, then RabbitMQ might be best for your application.
Data consulting and implementation services from Dattell provide STRATEGY, ENGINEERING, and PERSPECTIVE to support your organization’s data projects. Our services include custom Data Architecture, Business Analytics, Operational Intelligence, Centralized Reporting, Automation, and Machine Learning. Dattell specializes in Apache Kafka and the Elastic Stack for reliable data collection, storage, and real-time display.