简体繁体 English

使用nGram进行Elasticsearch自动完成，结果顺序

[英]Elasticsearch autocomplete using nGram, results order

原文 2014-06-06 06:12:23 4 1 elasticsearch

I implement autocomplete with the nGram filter, and everything works fine. 我使用nGram过滤器实现了自动完成功能，并且一切正常。

my problem is the suggestions returned seem to be in arbitrary order. 我的问题是返回的建议似乎是任意顺序的。

for example, I have a field called "id", they seem to be some numbers like "1000", "45100231", but are stored as string. 例如，我有一个名为“ id”的字段，它们似乎是一些数字，例如“ 1000”，“ 45100231”，但存储为字符串。 when I type in "10", I hope to see "1000" comes first then maybe "102000", etc. so the ideal suggestion order I want is: the matching part in prefix comes first, then the middle, then suffix. 当我输入“ 10”时，我希望先看到“ 1000”，然后可能是“ 102000”，以此类推。所以我想要的理想建议顺序是：前缀中的匹配部分首先出现，然后是中间部分，然后是后缀。 eg "1000">"2101">"1110". 例如“ 1000”>“ 2101”>“ 1110”。 If the matching parts are all in the beginning, just sort by the next digits. 如果匹配的部分都是开头，则按下一位数字排序。 eg "1000" > "1011" >"10200" 例如“ 1000”>“ 1011”>“ 10200”

I've been reading lots of posts about elasticsearch sorting but found no strategy that really works. 我读过很多有关Elasticsearch排序的文章，但没有找到真正有效的策略。 anyone got any idea? 有人知道吗？ thanks! 谢谢！

1 个解决方案

One way I see is keep autocomplete tokens in 3 fields: 1st field keeps prefixes (using edgeNgram) 2nd field keeps only middle word ngram parts (but I think this requires custom filter) 3nd field keeps only suffixes 我看到的一种方法是在3个字段中保留自动完成标记：第一个字段保留前缀（使用edgeNgram）第二个字段仅保留中间单词ngram部分（但我认为这需要自定义过滤器）第三个字段仅保留后缀

so for a value 12345 it generates next set of tokens: 因此，对于值12345它会生成下一组令牌：

prefixes: 12, 123, 1234, 12345 前缀： 12, 123, 1234, 12345
middle: 23, 34, 234 中： 23, 34, 234
suffixes: 2345, 345, 45 后缀： 2345, 345, 45

when you have such index, you could use bool filter with matching against this 3 fields, but with different boost factor, for example you boost prefixes ^10, middle ^1 and suffixes ^0.1 当您有这样的索引时，可以使用布尔过滤器来匹配这3个字段，但是具有不同的提升因子，例如，您可以提升前缀^ 10，中间^ 1和后缀^ 0.1

I believe the result must be acceptable. 我相信结果一定可以接受。

UPDATE 更新

for you case only with numbers, I think it's better to use script_score from this http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-function-score-query.html and manually check in mvel or javascript, if it's prefix, middle or suffix, but you should keep just raw_id in separate field. 对于仅以数字script_score情况，我认为最好使用http://www.elasticsearch.org/guide/zh-CN/elasticsearch/reference/0.90/query-dsl-function-score-query.html中的 script_score并手动进行检查在mvel或javascript中，如果是前缀，中间或后缀，则应仅在单独的字段中保留raw_id。