简体   繁体   中英

Kafka: How to delete records from a topic using Java API?

I'm looking for a way to delete (completely remove) consumed records from a Kafka topic. I know there are several ways of doing this, by either changing the retention time for the topic or removing the Kafka-logs folder, for example. But what I'm looking for is a way to delete a certain amount of records for a topic using the Java API, if that is possible.

I've tried testing the AdminClient API, specifically the adminclient.deleteRecords(recordsToDelete) method. But if I'm not mistaken, that method only Changes the offsets in the topic, not actually deleting said records from the hard drive.

Is there a Java API that does actually remove the records from the hard drive?

Kafka topics are immutable, meaning you can only add new messages to them. There is not delete per se.

However, to avoid "running out of disk", Kafka provides two concepts for keeping the size of topics down: retention policy and compaction.

Retention If you have a topic where you don't need the data around forever, you just set a retention policy of however long you need to have the data around, ie 72 hours. Kafka will then automatically delete messages older than 72 hours for you.

Compaction If you DO need data to stay around forever, or for a long time at least, but you only need the latest value, then you can set the topic to be compacted. This wil automatically remove older messages as soon as a new message is added with a key that already exists.

A central part of planning your Kafka architecture is to think through HOW your data is stored in a topic. If, for example, you push updates to a customer record in a kafka topic, let's say that customer's last login date (very contrived example...), then you're only interested in the LAST entry (since all previous entries are no longer the "last" login). If the partition key for this was the customer ID, and log compaction was enabled, then as soon as the user logs in and the kafka topic receives this event, any other previous message with the same partition key (customer ID) would be automatically removed from the topic.

This got me confused a bit also at first, why the included bin/kafka-delete-records.sh was able to delete but I couldn't using Java API

The missing piece is you need to call KafkaFuture.get() since deleteRecords returns a map of Futures

Here's the code

In this code, you need to call entry.getValue().get().lowWatermark()

DeleteRecordsResult result = adminClient.deleteRecords(recordsToDelete);
Map<TopicPartition, KafkaFuture<DeletedRecords>> lowWatermarks = result.lowWatermarks();
try {
    for (Map.Entry<TopicPartition, KafkaFuture<DeletedRecords>> entry : lowWatermarks.entrySet()) {
        System.out.println(entry.getKey().topic() + " " + entry.getKey().partition() + " " + entry.getValue().get().lowWatermark());
    }
} catch (InterruptedException | ExecutionException e) {
    e.printStackTrace();
}
adminClient.close();

I am using Kafka 2.1.1 on Red Hat 7.6 and the call to AdminClient.deleteRecords() did effectively remove the files from the corresponding folder in /tmp/kafka-logs. The only file left is leader-epoch-checkpoint and inside it there is information about the last record offset: 96 in my case.

Note that in the call to AdminClient.deleteRecords() you should not pass an offset which is greater than the existing high watermark of a partition. If you do, the call will fail with "org.apache.kafka.common.errors.OffsetOutOfRangeException: The requested offset is not within the range of offsets maintained by the server." but you will not know it, until you try to check the result via Future.get() - see the answer from Trix for details.

Kafka doesn't support removing records from topics. The way it works is by building a buffer of messages that grow as message pushed to it. While the client that reads the messages basically only holding an offset to that buffer. So clients in Kafka are basically in "read-only" mode and can't alter the buffer. Think about a case when several different clients (different client-groups) reading the same topic and each saves its own offset. what would happen if someone will start deleting messages from the buffer where the offset is set to.

没有Kafka不提供删除主题中特定偏移量的功能,并且对此没有API。

I can deleted. If linux is on a machine, it deletes it from hdd. When I searched from the internet, I found that there was a bug in windows. However, I could not find the solution to this bug in the windows. This code works if kafka is running on a linux machine.

public void deleteMessages(String topicName, int partitionIndex, int beforeIndex) {
       TopicPartition topicPartition = new TopicPartition(topicName, partitionIndex);
       Map<TopicPartition, RecordsToDelete> deleteMap = new HashMap<>();
       deleteMap.put(topicPartition, RecordsToDelete.beforeOffset(beforeIndex));
       kafkaAdminClient.deleteRecords(deleteMap);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM