Suppose I have a topic called "batch" with 1 partition and I publish millions of records to it for processing. I have a consumer group of 3 to process those millions of records. I encounter a case where I no longer need to process certain subset of messages which satisfy a certain criteria like age < 50
How do I remove those messages from the topic in a programmatic way. Like I click a "Cancel" button in the UI and it should remove those subset of records from the topic whose age < 50
so that it wont be processed by the consumers.
I know that I can remove messages by running a command line with offsets:- https://github.com/apache/kafka/blob/trunk/bin/kafka-delete-records.sh
And also the Java API but again by offsets:
Delete records whose offset is smaller than the given offset of the corresponding partition
But in my case I cannot use offsets because I only need to remove certain records and not all records smaller than the given offset
The main thing that I need to point out is that you shouldn't consider data in Kafka the same thing as data in a Database. Kafka has not been designed to work in such a way ( eg: when I click the X button, the Y records will be deleted ).
Instead, you should see a topic as a stream of never-ending data. Every record that is produced to a Kafka topic will be consumed and processed independently by the consumer.
Perceiving the topic as a stream gives you a different solution:
You can use a second topic with the filtered results in it!
Streaming Diagram ___ Topic A ____ -- Produced Messages --> | | _______________________ |________________| --> | | | Filtering Application | ___ Topic B ___ | | | | <-- |_______________________| <-- Consumed Messages -- |________________|
The explanation is quite simple, you produced the messages to topic A. Then you use a Filtering Application
which will:
age < 50
) will filter Finally, your consumers will receive the messages from topic B.
Now, when it comes to creating the filtering application you have a couple of options:
You can't, Kafka isn't designed to be used like a database, it's actually an immutable commit log. The delete records tool is used mostly for administrative tasks.
There is an exception, and that's if you use log compaction . If you have a compacted topic you can delete the value for a key by publishing a record to the topic with a NULL
value. Compacted topics are typically used like database commit logs and you read them into some downstream service where it's materialized like a table. The NULL
value should resolve into a record delete.
So in your use case you would materialize your topic to a system optimized for a query like SELECT key FROM TABLE WHERE age > 50;
, and publish records for each key with a value of NULL
back to the Kafka topic. You could even just start your consumer at the beginning of the topic and note which records have age > 50
and do the same thing but that's not going to be as efficient.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.