简体   繁体   English

使用NEST的Elasticsearch:如何配置分析器以查找部分单词?

[英]Elasticsearch using NEST: How to configure analyzers to find partial words?

I am trying to make a search by partial word, ignoring casing and ignoring the accentuation of some letters. 我正在尝试按部分单词进行搜索,而忽略大小写并忽略某些字母的强调。 Is it possible? 可能吗? I think ngram with default tokenizer should do the trick but i don't understand how to do it with NEST. 我认为使用默认标记器的ngram应该可以解决问题,但我不知道如何使用NEST来实现。

Example: "musiic" should match records that have "music" 示例:“音乐”应匹配具有“音乐”的记录

The version I am using of Elasticsearch is 1.9. 我使用的Elasticsearch版本是1.9。

I am doing like this but it doesn't work... 我正在这样做,但不起作用...

var ix = new IndexSettings();
        ix.Add("analysis",
            @"{
               'index_analyzer' : {
                          'my_index_analyzer' : {
                                        'type' : 'custom',
                                        'tokenizer' : 'standard',
                                        'filter' : ['lowercase', 'mynGram']
                          }
               },
               'search_analyzer' : {
                          'my_search_analyzer' : {
                                        'type' : 'custom',
                                        'tokenizer' : 'standard',
                                        'filter' : ['standard', 'lowercase', 'mynGram']
                          }
               },
               'filter' : {
                        'mynGram' : {
                                   'type' : 'nGram',
                                   'min_gram' : 2,
                                   'max_gram' : 50
                        }
               }
    }");
        client.CreateIndex("sample", ix);

Thanks, 谢谢,

David 大卫

Short Answer 简短答案

I think what you're looking for is a fuzzy query , which uses the Levenshtein distance algorithm to match similar words. 我认为您正在寻找的是一个模糊查询 ,该查询使用Levenshtein距离算法来匹配相似的单词。

Long Answer on nGrams 关于nGrams的长答案

The nGram filter splits the text into many smaller tokens based on the defined min/max range. nGram筛选器根据定义的最小/最大范围将文本分成许多较小的标记。

For example, from your 'music' query the filter will generate: 'mu', 'us', 'si', 'ic', 'mus', 'usi', 'sic', 'musi', 'usic', and 'music' 例如,从您的“音乐”查询中,过滤器将生成: 'mu', 'us', 'si', 'ic', 'mus', 'usi', 'sic', 'musi', 'usic', and 'music'

As you can see musiic does not match any of these nGram tokens. 如您所见, musiic与任何这些nGram令牌都不匹配。

Why nGrams 为什么选择nGrams

One benefit of nGrams is that it makes wildcard queries significantly faster because all potential substrings are pre-generated and indexed at insert time (I have seen queries speed up from multi-seconds to 15 milliseconds using nGrams). nGrams的一个好处是,它可以大大加快通配符查询的速度,因为所有潜在的子字符串都是在插入时预先生成和索引的(我已经看到使用nGrams可以将查询速度从几秒提高到15毫秒)。

Without the nGrams, each string must be searched at query time for a match [O(n^2)] instead of directly looked up in the index [O(1)]. 如果没有nGrams,则必须在查询时在每个字符串中搜索匹配项[O(n ^ 2)],而不是直接在索引[O(1)]中查找。 As pseudocode: 作为伪代码:

hits = []
foreach string in index:
    if string.substring(query):
        hits.add(string)
return hits

vs

return index[query]

Note that this comes at the expense of making inserts slower, requiring more storage, and heavier memory usage. 请注意,这样做的代价是使插入速度变慢,需要更多的存储空间并增加了内存使用量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM