Apache Kafka is a distributed system, running in a cluster with each of the nodes referred to as brokers. Kafka topics are partitioned and replicated across the brokers throughout the entirety of the implementation. These partitions allow users to parallelize topics, meaning data for any topic can be divided over multiple brokers. A critical component of Kafka optimization is optimizing the number of partitions in the implementation.
Since a topic can be split into partitions over multiple machines, multiple consumers can read a topic in parallel. This organization sets Kafka up for high message throughput.
In other words, the greater the parallelization the greater the throughput.
However, you don’t necessarily want to use more partitions than needed because increasing partition count simultaneously increases the number of open server files and leads to increased replication latency.
For most implementations you want to follow the rule of thumb of 10 partitions per topic, and 10,000 partitions per Kafka cluster. Going beyond that amount can require additional monitoring and optimization.
Calculating Kafka Partition Requirements
Here is the calculation we use to optimize the number of partitions for a Kafka implementation.
# Partitions = Desired Throughput / Partition Speed
Conservatively, you can estimate that a single partition for a single Kafka topic runs at 10 MB/s.
As an example, if your desired throughput is 5 TB per day. That figure comes out to about 58 MB/s. Using the estimate of 10 MB/s per partition, this example implementation would require 6 partitions.
For the example above, the number of partitions is set using the following code:
bin/kafka-topics.sh –zookeeper ip_addr_of_zookeeper:2181 –create –topic my-topic –partitions 6 –replication-factor 3 –config max.message.bytes=64000 –config flush.messages=1
The replication factor is set to 3 as a default. While partitions reflect horizontal scaling of unique information, replication factors refer to backups. For a replication factor of 3 in the example above, there are 18 partitions in total with 6 partitions being the originals and then 2 copies of each of those unique partitions.
As with all collected data, you want to ensure information is not lost if there is a failure. Creating replicated partitions is an important component to preventing data loss.
Starting with your partition estimate, it is best then to test partition throughput. Setting up Kafka monitoring will enable you to easily run these tests. Our post on Kafka monitoring with Elasticsearch and Kibana is a good place to start.
Have Kafka Questions?
We offer fully managed Kafka with top performance and exceptional support.
We offer a number of support services including RSA, workshops, and training.