简体繁体 English

Elasticsearch：搜索类似产品

[英]Elasticsearch: Search for similar products

原文 2017-01-11 17:25:32 5 1 elasticsearch/ elasticsearch-query

I have a list of 50 million products. 我列出了5000万种产品。 Each product has a list of 200 features. 每个产品都有200个功能的列表。 I am looking to find similar products by matching product features which has maximum overlap across the 200 features. 我希望通过匹配在200个功能中具有最大重叠量的产品功能来找到类似的产品。

Currently I concatenate the 200 words with spaces and form a long string. 目前，我将200个单词用空格连接起来，并形成一个长字符串。 When I want to find similar products for a particular selected product, I retrieve the stored 200 words long string and search elasticsearch. 当我想为特定的选定产品查找类似产品时，我检索存储的200个单词长的字符串并搜索elasticsearch。

This gives expected results, but each search takes roughly around 7 seconds. 这样可以得到预期的结果，但是每次搜索大约需要7秒钟。 That is because the search phrase is so long. 那是因为搜索词很长。 Is there a better way to do this and find best overlap on elastic ? 有没有更好的方法来做到这一点，并找到最佳的弹性重叠？

1 个解决方案

I suggest that you check/try few things: 我建议您检查/尝试一些事情：

I have a list of 50 million products. 我列出了5000万种产品。 Each product has a list of 200 features. 每个产品都有200个功能的列表。 I am looking to find similar products by matching product features which has maximum overlap across the 200 features. 我希望通过匹配在200个功能中具有最大重叠量的产品功能来找到类似的产品。 Currently I concatenate the 200 words with spaces and form a long string. 目前，我将200个单词用空格连接起来，并形成一个长字符串。

Assuming Product is a Doc type you could give it a try save features normally as array of values and enabling field data on it. 假设产品是Doc类型，您可以尝试将功能正常保存为值数组并在其上启用字段数据。 It would then be easy to use aggregations to group them applying the proper mentioned max overlap and get what you want. 然后，可以使用聚合将它们适当地提到的最大重叠量进行分组，从而获得所需的内容。 I strongly believe it would be much faster. 我坚信这会更快。

I retrieve the stored 200 words long string and search elasticsearch. 我检索存储的200个单词长的字符串并搜索elasticsearch。

Might have some situation where all you want is the aggregation result and not the full response with all product or other doc type, in this cases set search type to count(older versions) or query_then_fetch with size 0(newer versions) and you avoid an initial fetch of all doc types and get only the aggregations, this could be used in some situations based on your requirements. 在某些情况下，您可能需要的只是汇总结果，而不是所有产品或其他文档类型的完整响应，在这种情况下，请将搜索类型设置为count（旧版本）或size_0（新版本）的query_then_fetch，并避免使用初始获取所有文档类型并仅获取汇总，这可以根据您的要求在某些情况下使用。

Make sure you have elasticsearch environment proper prepared . 确保您已经准备好Elasticsearch环境。

Finally with this number of docs there's a chance that you find some shard /replicas configuration that suits better your case than the default one. 最后，有了如此多的文档，您有机会找到一些分片/副本配置，比默认配置更适合您的情况。