Updated August 2023
With many teams already familiar with Kubernetes, it can sometimes be the best choice to spin up Kafka servers on Kubernetes alongside their other applications.
Kafka on Kubernetes presents some challenges though. In this post we will introduce the most common issues we see to help you get an idea of what’s ahead on your Kafka on Kubernetes journey.
Should I Run Kafka on Kubernetes?
Let’s start with discussing the reasons why teams are choosing to deploy Kafka on Kubernetes.
It’s usually because of one of the following:
- There is internal expertise with K8s and not much inertia or bandwidth for learning a new configuration management solution,
- The company has already chosen to host both stateless and stateful applications within K8s, or
- The reduced reliability of Kafka running within Kubernetes is outweighed by the convenience of Kubernetes.
These are three important and justifiable reasons.
However, from a peak performance and reliable management perspective, it’s better to deploy Kafka outside of Kubernetes.
This limitation might change over time with projects like Virtlet that will pseudo-containerize the file system cache which normally lives outside of the K8s pod/container.
From a peak performance and reliable management perspective, it’s better to deploy Kafka outside of Kubernetes. This limitation might change over time with the pseudo-containerization of the file system cache.
In the next section we will be covering some of the conflicts that arise from running Kafka on Kubernetes. We are sharing these conflicts for two reasons:
- Firstly, if your team is in a situation where you can just as easily spin up Kafka on a standalone cluster versus within Kubernetes, then you’ll have more information to help you make the decision.
- Secondly, for those teams who have decided deploying Kafka on K8s is the best path forward, you’ll have an idea of what issues you’ll face and timeline for fixing and verifying the fixes.
One last point before we move on. Some teams prefer to run Kafka on K8s because they are running Kafka on-prem and need help with managing it. These teams might be unaware that companies like ours offer fully managed Kafka for on-prem deployments. With that kind of solution there is no concern about automating upgrades and managing the pain points of containerizing a stateful application because an expert team is managing and monitoring the deployment 24/7. Food for thought.
Okay, onto the conflicts.
Conceptual Conflicts With Running Kafka on Kubernetes
First up, let’s define the primary uses for Kafka and Kubernetes.
Kafka is used to reliably deliver messages. Reliably is the key word here. It’s Kafka’s stability, high throughput, and exactly once-ness that teams rely upon.
Kubernetes is used to orchestrate infrastructure.
With their primary purposes in mind, let’s consider the conceptual conflicts with running Kafka on Kubernetes.
Kafka is stateful, and K8s is designed for stateless applications.
Stateful applications, such as Kafka, contain past knowledge and reference previous interactions. Because of this necessary reference knowledge with stateful applications, the same servers need to be called on every time a request is processed. Running stateful applications on Kubernetes requires turning off many of Kubernetes’ features.
Disabling K8s features such as load balancing to prevent issues / Kafka breaking.
Kubernetes improves development time for new applications by providing things like cluster management and load balancing. However, Kafka already includes those features.
The K8s automations are both redundant and a liability.
If Kubernetes performs any of the cluster management or load balancing, then that causes issues / will break Kafka.
Fortunately, these default Kubernetes features can be turned off.
Let’s take a minute to further explain the issue. Load balancing is problematic for Kafka because a single load balancer can’t exist in front of all Kafka nodes without Kafka breaking. The workaround for this is creating a single load balancer for each node.
Additionally, load balancers and health checks also make exactly once messaging trickier to guarantee.
Kafka can’t exist solely within a container.
K8s uses containers. Containers don’t virtualize the operating system kernel. Instead, containers use the operating system of their host/node. This allows containers to use less resources than virtual machines (VM). VMs have their own operating systems.
Kafka uses the file system cache to store the message queue data, and that storage exists on the operating system kernel. In other words, Kafka doesn’t solely exist inside of a container.
While the characteristics of containers being lightweight and easily moveable can be great for some applications, they don’t provide a benefit to Kafka. This is because you wouldn’t want to move Kafka and because Kafka requires an operating system.
There is a project called Virtlet that is working on making the file system part of the container, but it‘s still in early stages. There are also other projects that are testing out virtualizing parts of the kernel inside containers, such as only the file system cache. Virtualizing only the parts of the kernel that Kafka uses would be ideal.
Rolling restart issues.
Carrying out a rolling restart of a high traffic / high volume Kafka cluster on Kubernetes is more difficult because other apps running on the K8s node could also be using the file system cache.
It can then be difficult to know when the restarted Kafka broker has warmed up its local cache, and the restart should proceed onto the next broker.
Other apps running on K8s are a liability to Kafka’s stability.
Because Kafka doesn’t exist solely in its container, running Kafka within Kubernetes makes the other applications running within K8s a liability to Kafka’s stability.
We aren’t touching on every single problem here, but it gives you an idea of the types of conflicts that exist when running Kafka on Kubernetes. With the exception of the file system cache issue, all of the other issues can be addressed with a good amount of effort and expertise.
From our experience, it takes months to implement fixes and verify that they are working as expected.
Strimzi Operator and Confluent for Kubernetes
Above we touched on how operators, such as Strimzi, can help with turning off some of the functionality in Kubernetes that is incompatible with Kafka. Before we close out the article we wanted to walk you through operators in a little more detail.
Operators are application-specific software extensions for deploying and managing an application in Kubernetes.
Users typically provide basic cluster information such as the number of brokers, CPU and RAM limits, storage size, authentication and encryption information, and other high level configurations. The operator then takes on the tasks of deploying, managing, and updating Kuberenetes resources as needed to keep Kafka running.
Operators also provide automated solutions to *some of* the problems that users would run into if setting up Kafka in Kubernetes manually. And operators also can assist with creating certificates automatically to make updates a little less painful.
However, operators don’t help with things like configuring Kafka to meet your use case, capacity planning, and configuring/implementing Kafka producers and consumers.
Additionally, operators lack monitoring of critical metrics like the file system cache hit/miss ratio.
Operators don’t help with configuring Kafka to meet your use case, capacity planning, and configuring/implementing Kafka producers and consumers. Additionally, operators lack monitoring of critical metrics like the file system cache hit/miss ratio.
There are a number of operators available: RedHat’s Strimzi, Banzai Cloud Operator, Cisco’s koperator, and Confluent’s “Confluent for Kubernetes”.
Confluent for Kubernetes (CFK) was made available in spring of 2021 and allows users to deploy Confluent for Kubernetes in on-premise environments. As with regular Confluent Kafka, users are limited by their licensing and applicable fees. Keep that in mind if your team is making a decision of whether to use Apache Kafka or Confluent Kafka.
We go into more depth about the differences between Apache Kafka and Confluent Kafka in an earlier blog post on Apache Kafka vs Confluent Kafka. If you are already using Confluent Kafka, then using CFK is likely the best option because it’s the only operator that supports Confluent’s enterprise products.
Strimzi is arguably the most popular operator. It’s an open source project under the Apache 2.0 License and, like Kubernetes, it’s part of the Cloud Native Computing Foundation (CNCF). Strimzi supports both Kafka and ZooKeeper, and it additionally supports MirrorMaker, Kafka Connect clusters, Cruise Control, and Strimzi Bridge. Strimzi, like Kafka, is written in Java.
A more in-depth discussion of Kubernetes operators can be found on the K8s website linked here.
Help With Deploying Kafka on K8s
Calling on help from outside the organization could be the right choice for your project. In addition to making Kafka run reliably and efficiently, we also typically save our clients more money than our fees cost.
The primary ways we help clients save money is by preventing the over-purchase of hardware, making more efficient use of their hardware, and reducing time to project completion.
Reach out if you think your project could benefit from an outside team that focuses on deploying, optimizing, and managing Kafka on both standalone clusters and Kubernetes.
Have Kafka Questions?
Managed Kafka on your environment with 24/ 7 support.
Consulting support to implement, troubleshoot,
and optimize Kafka.