One aspect of Kafka that can cause some confusion for new users is the consumer offset. In this post, we define consumer offset and outline the factors that determine the offset.
Defining Kafka Consumer Offset
The consumer offset is a way of tracking the sequential order in which messages are received by Kafka topics. Keeping track of the offset, or position, is important for nearly all Kafka use cases but can be mission critical in certain instances, such as financial services.
The Kafka consumer offset allows processing to continue from where it last left off if the stream application is turned off or if there is an unexpected failure. In other words, by having the offsets persist in a data store (Kafka and/or ZooKeeper), data continuity is retained even when the stream application shuts down or fails.
As discussed in a previous post, offsets are one of three metrics that when used together can locate or identify a message. First there is the topic, then within topics are partitions, and then finally the ordering of the messages in the topics is referred to as the offset. Side note, if you’re curious about how to optimize the number of partitions, check out this easy formula.
Did you know that Dattell offers Kafka as a Service?
Dattell’s Kafka as a Service is a fully managed, high-throughput messaging system built on your cloud instances or On-Prem servers, providing enhanced security, reduced latency, and cost effectiveness.
Determining Kafka Consumer Offset
New Consumer Groups
Initially, when a Kafka consumer starts for a new topic, the offset begins at zero (0). Easy enough.
On the other hand, if a new consumer group is started in an existing topic, then there is no offset store. In this scenario, the offset will either begin from the beginning of a topic or the end of the topic. The beginning of a topic would give the smallest possible offset. The end of the topic would be the greatest possible offset.
Whether you start at the beginning or end of a topic is determined by your use case. If you start the offset at the beginning of a topic, then you will be replaying data. This approach is good for building out a new server and populating it with data, or for doing load testing on a Kafka cluster. If your needs don’t require any of those functions, then you likely will want to start at the end of the topic.
Existing Consumer Groups
What about for existing consumer groups? Let’s say for instance that a consumer group consumes 12 messages before failing. When the consumer starts up again, it will continue from where it left off in the offset (or position) because that offset is stored by Kafka and/or ZooKeeper.
If you are ever curious about where the offset is at, you can open the kafka-consumer-groups tool. This tool will provide you with both the offsets and lag of consumers for the various topics and partitions. Keep in mind that the consumer has to be active when you run this command to see its current offset.
Log Retention’s Impact on Offset
Log retention times can also impact consumer offset. Let’s consider an example where the log retention is set to three (3) days. What would happen if 32 messages were received over a couple of hours, and then four (4) days go by before the next message is received? Where would the offset begin?
The answer is it depends on the offset retention period. The default retention period for message offsets in Kafka is one week (7 days). If Kafka was configured using the default, then to answer the questions above, the offset would begin at 32.
If the amount of time passed was two weeks (14 days), then the offset would be changed to the latest offset, since the previous offset would have been removed at one week (7 days).
The finite offset retention period exists to avoid overflow. However, it isn’t set in stone. If you want to extend the retention beyond a week, simply specify the desired retention period when creating a Kafka topic.
Still have questions about Kafka? Connect with one of our Kafka engineers.
99.95% Uptime Guarantee, Built on Your Servers or Cloud Instances for Unmatched Data Authority, Reduced Latency, and Cost Effectiveness.