Issues with Apache Kafka performance are directly tied to system optimization and utilization. Here, we compiled the best practices for a high volume, clustered, general use case. Keep in mind, these recommendations are generalized, and thorough Kafka Monitoring would inform the Kafka implementation that best fits your custom use case.
Kafka BrokerJAVA SETTINGS Use the latest java 1.8 with G1GC. JVM settings: -Xmx6g -Xms6g -XX:MetaspaceSize=96m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 Enable JMX for monitoring. bin/kafka-run-class.sh: KAFKA_JMX_OPTS=”-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=your.kafka.broker.hostname -Djava.net.preferIPv4Stack=true” bin/kafka-server-start.sh: export JMX_PORT=PORT OS SETTINGS Once the JVM size is determined leave the rest of the RAM to the OS for page caching. Kafka needs the page cache for writes and reads. Kafka runs on any Unix system and has been tested on Linux and Solaris. We run the latest version of CentOS for reasons outside of Kafka. File descriptor limits: Kafka uses file descriptors for log segments and open connections. We recommend at least 100,000 allowed file descriptors for the broker processes as a starting point. Max socket buffer size: Kafka can increase buffer size to enable high-performance data transfer between data centers.
Disks And File System
The disk and file system usage is where we see people make the most mistakes.Use only one drive or RAID array per partition! If you have multiple partitions per hard drive, use an SSD instead of HDD. Do not share the drive(s) dedicated to a partition with other applications or the operating system as other partitions or programs will disrupt sequential reads/writes. Multiple drives can be configured using log.dirs in server.properties. Kafka assigns partitions in round-robin fashion to log.dirs directories. Create alerts based off disk usage on each of your Kafka-dedicated drives. We use RAID mainly because of the automatic recovery feature. Keep in mind that while a RAID array is rebuilding, the Kafka node will act as though it is down due to disk usage being dedicated to the rebuild.
Log Flush Management
Use the default flush settings which disable application fsync entirely. Done.