Kafka delete records from the topic without using offsets but by a field of the record

Question

Suppose I have a topic called "batch" with 1 partition and I publish millions of records to it for processing. I have a consumer group of 3 to process those millions of records. I encounter a case where I no longer need to process certain subset of messages which satisfy a certain criteria like age < 50

How do I remove those messages from the topic in a programmatic way. Like I click a "Cancel" button in the UI and it should remove those subset of records from the topic whose age < 50 so that it wont be processed by the consumers.

I know that I can remove messages by running a command line with offsets:- https://github.com/apache/kafka/blob/trunk/bin/kafka-delete-records.sh

And also the Java API but again by offsets:

https://kafka.apache.org/11/javadoc/org/apache/kafka/clients/admin/AdminClient.html#deleteRecords-java.util.Map-org.apache.kafka.clients.admin.DeleteRecordsOptions-

Delete records whose offset is smaller than the given offset of the corresponding partition

But in my case I cannot use offsets because I only need to remove certain records and not all records smaller than the given offset

Answer 1

The main thing that I need to point out is that you shouldn't consider data in Kafka the same thing as data in a Database. Kafka has not been designed to work in such a way ( eg: when I click the X button, the Y records will be deleted ).

Instead, you should see a topic as a stream of never-ending data. Every record that is produced to a Kafka topic will be consumed and processed independently by the consumer.

Perceiving the topic as a stream gives you a different solution:

You can use a second topic with the filtered results in it!

 Streaming Diagram ___ Topic A ____ -- Produced Messages --> | | _______________________ |________________| --> | | | Filtering Application | ___ Topic B ___ | | | | <-- |_______________________| <-- Consumed Messages -- |________________|

The explanation is quite simple, you produced the messages to topic A. Then you use a Filtering Application which will:

Consume your messages from topic A
Based on some business logic ( eg: age < 50 ) will filter
Produce the filtered messages to topic B

Finally, your consumers will receive the messages from topic B.

Now, when it comes to creating the filtering application you have a couple of options:

Implement a basic solution using a consumer and a producer
Use Kafka Streams
Use KSQL

Answer 2

You can't, Kafka isn't designed to be used like a database, it's actually an immutable commit log. The delete records tool is used mostly for administrative tasks.

There is an exception, and that's if you use log compaction . If you have a compacted topic you can delete the value for a key by publishing a record to the topic with a NULL value. Compacted topics are typically used like database commit logs and you read them into some downstream service where it's materialized like a table. The NULL value should resolve into a record delete.

So in your use case you would materialize your topic to a system optimized for a query like SELECT key FROM TABLE WHERE age > 50; , and publish records for each key with a value of NULL back to the Kafka topic. You could even just start your consumer at the beginning of the topic and note which records have age > 50 and do the same thing but that's not going to be as efficient.

Kafka delete records from the topic without using offsets but by a field of the record

Question

2 answers

solution1
3 ACCPTED 2019-05-17 16:55:27

solution2
2 2019-05-17 17:02:08

Kafka delete records from the topic without using offsets but by a field of the record

Question

2 answers

solution1 3 ACCPTED 2019-05-17 16:55:27

solution2 2 2019-05-17 17:02:08

solution1
3 ACCPTED 2019-05-17 16:55:27

solution2
2 2019-05-17 17:02:08