简体   繁体   中英

Avoid duplication of Kafka producer message

I am using KafkaTemplate from Spring boot.Java 8

My main aim is that consumer should not consume the message twice.

1) Calling a table to get 100 rows and send it to kafka

2) Suppose i process 70 rows( i get success ack ) and then Kafka went down(Kafka does not recover within RETRY mechanism timing)

So when i restart the spring boot app then how can i make sure those 70 messages aren't sent again.

One option is that i can have flag in DB table message is_sent = Y or N .

Is there any other efficient way?

I would use a JDBC source connector (depending on what databse you are currently using) with Kafka Connect that handles this scenario properly.


In case you still want to write your own producer, this section of Kafka FAQ should be useful:

How do I get exactly-once messaging from Kafka?

Exactly once semantics has two parts: avoiding duplication during data production and avoiding duplicates during data consumption.

There are two approaches to getting exactly once semantics during data production:

  1. Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded
  2. Include a primary key (UUID or something) in the message and deduplicate on the consumer.

If you do one of these things, the log that Kafka hosts will be duplicate-free. However, reading without duplicates depends on some co-operation from the consumer too. If the consumer is periodically checkpointing its position then if it fails and restarts it will restart from the checkpointed position. Thus if the data output and the checkpoint are not written atomically it will be possible to get duplicates here as well. This problem is particular to your storage system. For example, if you are using a database you could commit these together in a transaction. The HDFS loader Camus that LinkedIn wrote does something like this for Hadoop loads. The other alternative that doesn't require a transaction is to store the offset with the data loaded and deduplicate using the topic/partition/offset combination.

I think there are two improvements that would make this a lot easier:

  1. Producer idempotence could be done automatically and much more cheaply by optionally integrating support for this on the server.
  2. The existing high-level consumer doesn't expose a lot of the more fine grained control of offsets (eg to reset your position). We will be working on that soon

For Kafka, i have seen implementation of storing the pointer to the id to keep track of where you are in the topic, and using some sort of distributed storage to keep track of this at a cluster level. I haven't done much work there, so i will try and provide a solution we used with SQS for dup detection. It is likely that Kafka has a better solution that this one to solve for duplication, just want to add there so that you can look at alternate solutions as well.

I had the same problem while working with AWS SQS for point to point messaging use cases, as it provides an at-least once delivery guarantee vs. once and only once.

We ended up using Redis with its its distributed locking strategy to solve this problem. I have a write up here https://angularthinking.blogspot.com/ .

The high level approach is to create a distributed lock to put an entry in cache with appropriate TTL for your use case. We use LUA script to do a putIfNotExists() method as shown in the blog above. Scale was one of our concern and with the above implementation we were able to process 10s of thousands of messages per second without any problems in SQS and redis scaled very well. We had to tune the TTL to an optimum value based on throughput and cache growth. We did have the benefit of the duplication window to be 24 hrs or less, so depending on redis for this decision was OK. If you have longer windows where the duplicates could happen across several days or months, redis option might not be suitable.

We also looked at DynamoDB to implement putIfNotExists(), but redis seemed more performance for this use case, especially with its native putIfNotExists implementation using LUA script.

Good luck with your search.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM