简体繁体 English

使用坚果时重复-> Elasticsearch Solution

[英]Duplicates when using nutch -> elasticsearch solution

原文 2012-02-23 14:06:35 4 2 indexing/ nutch/ elasticsearch/ duplicate-removal/ web-crawler

I have crawled some data using nutch and managed to inject it into elasticsearch. 我已经使用nutch抓取了一些数据，并设法将其注入elasticsearch。 But I have one problem: If I inject the crawled data again it will create duplicates. 但是我有一个问题：如果我再次注入爬行的数据，它将创建重复项。 Is there any way of disallowing this? 有什么办法禁止这样做吗？

Has anyone managed to solve this or have any suggestions on how to solve it? 是否有人设法解决这个问题或对如何解决有任何建议？

/Samus / SAMUS

2 个解决方案

One way , you can keep an index of check sum of all data you have entered into elasticSearch in some db and cross refer those before attempting to send data to elasticSearch. 一种方法是，您可以在某个db中保留已输入elasticSearch的所有数据的校验和的索引，并在尝试将数据发送到elasticSearch之前交叉引用这些数据。 Or then you can run a "more like this" query to see similar documents and take decision based on that. 或者，您可以运行“更像这样”查询以查看相似的文档并据此做出决策。

LINK - http://www.elasticsearch.org/guide/reference/query-dsl/mlt-field-query.html 链接-http: //www.elasticsearch.org/guide/reference/query-dsl/mlt-field-query.html

If you index each page/document crawled with the same id in ElasticSearch it won't duplicate it. 如果您在ElasticSearch中为每个具有相同ID爬网的页面/文档建立索引，则不会重复。 You could use a checksum/hash function to turn the page's URL into a distinct ID. 您可以使用校验和/哈希函数将页面的URL转换为不同的ID。

You can also use Operation_type to ensure that if that id is already indexed it should not reindex it: 您还可以使用Operation_type来确保如果该ID已被索引，则不应对其重新索引：

The index operation also accepts an op_type that can be used to force a create operation, allowing for “put-if-absent” behavior. 索引操作还接受可用于强制执行创建操作的op_type，从而允许“如果不存在”的行为。 When create is used, the index operation will fail if a document by that id already exists in the index. 使用create时，如果索引中已经存在具有该ID的文档，则索引操作将失败。

ElasticSearch index API ElasticSearch索引API