简体繁体 English

避免 Kafka 生产者消息的重复

[英]Avoid duplication of Kafka producer message

原文 2020-04-27 16:10:27 8 2 java/ apache-kafka/ spring-kafka/ kafka-producer-api

I am using KafkaTemplate from Spring boot.Java 8我正在使用KafkaTemplate boot.Java 8 中的 KafkaTemplate

My main aim is that consumer should not consume the message twice.我的主要目标是消费者不应该两次消费消息。

1) Calling a table to get 100 rows and send it to kafka 1）调用一个表获取100行并发送给kafka

2) Suppose i process 70 rows( i get success ack ) and then Kafka went down(Kafka does not recover within RETRY mechanism timing) 2）假设我处理了 70 行（我得到了成功 ack）然后 Kafka 宕机了（Kafka 没有在 RETRY 机制计时内恢复）

So when i restart the spring boot app then how can i make sure those 70 messages aren't sent again.因此，当我重新启动 spring 启动应用程序时，我如何确保不再发送这 70 条消息。

One option is that i can have flag in DB table message is_sent = Y or N .一种选择是我可以在数据库表消息is_sent = Y or N中有标志。

Is there any other efficient way?有没有其他有效的方法？

2 个解决方案

I would use a JDBC source connector (depending on what databse you are currently using) with Kafka Connect that handles this scenario properly.我将使用JDBC 源连接器（取决于您当前使用的数据库）与Kafka Connect正确处理这种情况。

In case you still want to write your own producer, this section of Kafka FAQ should be useful:如果您仍然想编写自己的生产者，Kafka FAQ 的这一部分应该很有用：

How do I get exactly-once messaging from Kafka?如何从 Kafka 获取一次性消息？

Exactly once semantics has two parts: avoiding duplication during data production and avoiding duplicates during data consumption. Exactly once 语义有两个部分：在数据生产过程中避免重复和在数据消费过程中避免重复。

There are two approaches to getting exactly once semantics during data production:有两种方法可以在数据生产期间获得恰好一次的语义：

Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded每个分区使用一个写入器，每次遇到网络错误时，检查该分区中的最后一条消息，看看你的最后一次写入是否成功

Include a primary key (UUID or something) in the message and deduplicate on the consumer.在消息中包含主键（UUID 或其他内容）并对使用者进行重复数据删除。

If you do one of these things, the log that Kafka hosts will be duplicate-free.如果您执行其中一项操作，Kafka 托管的日志将是无重复的。 However, reading without duplicates depends on some co-operation from the consumer too.然而，没有重复的阅读也取决于消费者的一些合作。 If the consumer is periodically checkpointing its position then if it fails and restarts it will restart from the checkpointed position.如果消费者定期检查其 position，那么如果它失败并重新启动，它将从检查点 position 重新启动。 Thus if the data output and the checkpoint are not written atomically it will be possible to get duplicates here as well.因此，如果数据 output 和检查点不是原子写入的，那么这里也可能会出现重复。 This problem is particular to your storage system.此问题特定于您的存储系统。 For example, if you are using a database you could commit these together in a transaction.例如，如果您使用的是数据库，则可以在事务中将它们一起提交。 The HDFS loader Camus that LinkedIn wrote does something like this for Hadoop loads. LinkedIn 编写的 HDFS 加载器 Camus 对 Hadoop 加载执行了类似的操作。 The other alternative that doesn't require a transaction is to store the offset with the data loaded and deduplicate using the topic/partition/offset combination.不需要事务的另一种替代方法是使用加载的数据存储偏移量，并使用主题/分区/偏移量组合进行重复数据删除。

I think there are two improvements that would make this a lot easier:我认为有两个改进可以让这变得更容易：

Producer idempotence could be done automatically and much more cheaply by optionally integrating support for this on the server.通过选择性地在服务器上集成对此的支持，生产者幂等性可以自动完成，而且成本更低。

The existing high-level consumer doesn't expose a lot of the more fine grained control of offsets (eg to reset your position).现有的高级消费者并没有公开很多更细粒度的偏移控制（例如重置您的位置）。 We will be working on that soon我们将尽快解决这个问题

For Kafka, i have seen implementation of storing the pointer to the id to keep track of where you are in the topic, and using some sort of distributed storage to keep track of this at a cluster level.对于 Kafka，我已经看到了存储指向 id 的指针以跟踪您在主题中的位置，并使用某种分布式存储在集群级别跟踪这一点的实现。 I haven't done much work there, so i will try and provide a solution we used with SQS for dup detection.我在那里没有做太多的工作，所以我将尝试提供我们与 SQS 一起使用的用于 dup 检测的解决方案。 It is likely that Kafka has a better solution that this one to solve for duplication, just want to add there so that you can look at alternate solutions as well.卡夫卡很可能有一个更好的解决方案来解决重复问题，只是想在那里添加，以便您也可以查看替代解决方案。

I had the same problem while working with AWS SQS for point to point messaging use cases, as it provides an at-least once delivery guarantee vs. once and only once.在使用 AWS SQS 处理点对点消息传递用例时，我遇到了同样的问题，因为它提供了至少一次交付保证，而不是一次且仅一次。

We ended up using Redis with its its distributed locking strategy to solve this problem.我们最终使用 Redis 及其分布式锁定策略来解决这个问题。 I have a write up here https://angularthinking.blogspot.com/ .我在这里写了https://angularthinking.blogspot.com/ 。

The high level approach is to create a distributed lock to put an entry in cache with appropriate TTL for your use case.高级方法是创建一个分布式锁，将条目放入缓存中，并为您的用例提供适当的 TTL。 We use LUA script to do a putIfNotExists() method as shown in the blog above.我们使用 LUA 脚本来执行 putIfNotExists() 方法，如上面的博客所示。 Scale was one of our concern and with the above implementation we were able to process 10s of thousands of messages per second without any problems in SQS and redis scaled very well.规模是我们关注的问题之一，通过上述实现，我们能够在 SQS 中每秒处理成千上万条消息，而 redis 的规模非常好。 We had to tune the TTL to an optimum value based on throughput and cache growth.我们必须根据吞吐量和缓存增长将 TTL 调整为最佳值。 We did have the benefit of the duplication window to be 24 hrs or less, so depending on redis for this decision was OK.我们确实受益于复制 window 为 24 小时或更短，因此取决于 redis 做出此决定是可以的。 If you have longer windows where the duplicates could happen across several days or months, redis option might not be suitable.如果您有更长的 windows 重复可能在几天或几个月内发生，redis 选项可能不适合。

We also looked at DynamoDB to implement putIfNotExists(), but redis seemed more performance for this use case, especially with its native putIfNotExists implementation using LUA script.我们还查看了 DynamoDB 以实现 putIfNotExists()，但 redis 似乎在此用例中性能更高，尤其是使用 LUA 脚本的原生 putIfNotExists 实现。

Good luck with your search.祝您搜索顺利。