简体繁体 English

与ElasticSearch中的文档相比，通过Kafka创建传入文档的唯一列表的最有效方法是什么？

[英]What is the most efficient way of creating a unique list of incoming documents through Kafka when compared with those in ElasticSearch?

原文 2017-01-20 20:24:50 6 1 elasticsearch/ apache-kafka/ producer-consumer

In ElasticSearch, I will have an index type of RSS documents, each with their own hash. 在ElasticSearch中，我将具有RSS文档的索引类型，每个文档都有自己的哈希。

Next, I have a scheduler that retrieves a list of RSS documents from a feed through Kafka Connect, to add as a microservices broker. 接下来，我有一个调度程序，该调度程序通过Kafka Connect从提要中检索RSS文档列表，以添加为微服务代理。

Using the BulkRequestBuilder or BulkProcessor , which option is best (I also read that the latter is preferable due to performance issues): 使用BulkRequestBuilder或BulkProcessor ，哪个选项是最好的（我还阅读到由于性能问题，后者更可取）：

Add all incoming RSS documents to a list with a hash based on the title; 将所有传入的RSS文档添加到带有基于标题的哈希值的列表中； iterate through the list and remove any document's that have a hash match of those in ES 遍历列表，并删除任何哈希值与ES中的哈希值匹配的文档
Before adding a document to the list, check if its hash already exists in the ES db then add it to the list 在将文档添加到列表之前，请检查ES数据库中是否已存在其哈希，然后将其添加到列表中

There may be a better way as well, which I welcome. 我也欢迎有更好的方法。

Documents will be removed from Kafka once they have been consumed, so in this case would using Kafka Streams come into play? 一旦使用完文档，将从Kafka中删除它们，那么在这种情况下，使用Kafka Streams是否会起作用？ And now rather than doing the compare through a query of sorts, in the Kafka Producer code, we use the Exactly-Once, or does this go in the consumer code - something like that. 现在，我们不是在Kafka Producer代码中通过各种查询来进行比较，而是使用Exactly-Once，或者将其输入消费者代码中-诸如此类。

If I'm on the right track with this, can someone please elaborate? 如果我在正确的道路上，可以请别人详细说明一下吗？

1 个解决方案

With Bulk option , existing document id will be completely replaced with the incoming one, if that's ok for the use case, you don't have to do anything extra there. 使用Bulk option ，现有的文档ID将完全被传入的ID取代，如果用例还可以，那么您不必在此做任何额外的事情。

Kafka can guarantee once delivery most of the time but not all the time provided that your producers are not dup producing the message, exception is to a few messages(potential dup delivery) could be during the rebalance events on the kafka cluster, and consumers should've a way to handle it. Kafka可以保证大多数时间一次交付，但不能保证所有时间都可以交付，前提是您的生产者不是dup来生产消息， kafka集群上的重新平衡事件中可能会有一些消息（潜在的dup交付）是例外，消费者应该有一种处理方法。

Kafka on the other hand is different from other conventional brokers(JMS based), a message is not be deleted from kafka on consumption, that's driven by retention period setting per topic or generally . 另一方面，Kafka与其他常规代理（基于JMS）不同，在使用时不会从kafka删除消息，这是由retention period setting per topic or generally的retention period setting per topic or generally 。 The good thing about it is that, you can always go back in-time to consume old messages or build new use-cases with a need to consume old messages. 这样做的好处是，您总是可以及时返回以使用旧消息，或者建立需要使用旧消息的新用例。