Pink background with a broken computer

How to Prevent a Kafka Outage

Published August 2023

Apache Kafka is a highly reliable tool when configured correctly for your use case.  It should be the piece of your data architecture that you can be sure will remain online.  

Here we put together eight important best practices to help shore up your Kafka implementation.

8 Tips to Prevent Kafka Downtime

1 – Monitoring.  

Monitoring cluster performance is integral to diagnosing system issues.  Monitoring is helpful both during outages and to prevent outages.  We have an article describing how to monitor Kafka using either Elasticsearch or OpenSearch. 

2 – Version control. 

All configurations should be tracked. Keep Kafka configurations secure using version control.   Learn how to check which Kafka version is running.

3 – Distribution.  

Your Kafka brokers should be distributed to protect against a failure of any individual piece of hardware / infrastructure. 

4 – Partitions.  

Partitions allow users to parallelize topics, meaning data for any topic can be divided over multiple brokers.  A critical component of Kafka optimization is optimizing the number of partitions in the implementation.  We have an article detailing how to determine how many partitions are best based on your desired throughput.  

5 – Replication. 

Each partition should be set to a total of three replicas.  In the event of a broker/partition failure, one of the two replicas will become the leader partition.  Note that you must have at least three replicas to properly support a single broker failure. 

6 – Redundancy.  

If you’re running Kafka on-prem, ensure that there is redundancy of hardware including networking equipment, storage, etc.  

7 – Upgrades. 

Stay on top of upgrades to clusters and client libraries. Each new version of Kafka addresses bugs that are present in older versions. By upgrading you can prevent an outage due to a bug in an older version. 

We recommend that you stay away from the absolute latest version of Apache Kafka unless there is a specific bug fix you need.  As a general rule, we stay about three releases behind to let others test the new releases and features.

Read about how to check which Kafka version you are running.

8 – Consumer Optimization.  

Improving the performance of consumers aids the performance and reliability of your Kafka cluster.  Rebalancing, exactly once processing, good network connections, number of consumers, and message size are all important to consumers running properly.  Check out our article detailing consumer optimization for more information.

Applying these recommendations will help to increase Kafka stability.  

If you use Dattell’s managed services for Kafka, our engineers ensure your Kafka implementation is correctly optimized for your use case so you don’t have to worry about identifying issues, running preventative maintenance, and troubleshooting outages.

Have Kafka Questions?

Managed Kafka on your environment with 24/ 7 support.

Consulting support to implement, troubleshoot,
and optimize Kafka.

Schedule a call with a Kafka solution architect.

Published by

Dattell - Kafka & Elasticsearch Support

Benefit from the experience of our Kafka, Pulsar, Elasticsearch, and OpenSearch expert services to help your team deploy and maintain high-performance platforms that scale. We support Kafka, Elasticsearch, and OpenSearch both on-prem and in the cloud, whether on stand alone clusters or running within Kubernetes. We’ve saved our clients $100M+ over the past six years. Without our guidance companies tend to overspend on hardware or purchase unnecessary licenses. We typically save clients multiples more money than our fees cost in addition to building, optimizing, and supporting fault-tolerant, highly available architectures.

Leave a Reply