简体   繁体   English

如何使用Nest C#Client在Elasticsearch中进行重音不敏感搜索?

[英]How to make an accent insensitive search in elasticsearch with nest c# client?

I'm an elasticsearch newbie. 我是Elasticsearch新手。

Lets say we have a class like this: 可以说我们有一个这样的类:

public class A
{
    public string name;
}

And we have 2 documents which have names like "Ayşe" and "Ayse" . 我们有2个文档,其名称分别为“Ayşe”“ Ayse”

Now, I want to be able to store names with their accents but when I search want to be able take results of accent insensitive query as accent sensitive results . 现在, 我希望能够存储带有重音符号的名称,但是当我搜索时希望能够将不重音符号查询结果作为重音符号结果

For ex: When I search for "Ayse" or "Ayşe" , it should return both "Ayşe" and "Ayse" as they stored (with accent). 例如:当我搜索“ Ayse”“Ayşe”时 ,它应同时返回存储的“Ayşe”和“ Ayse” (带有重音符号)。

Right now when I search for "Ayse" it only returns "Ayse" but I want to have "Ayşe" as a result too. 现在,当我搜索“ Ayse”时,它仅返回“ Ayse”,但我也希望得到“Ayşe”。

When I checked elasticsearch documentation, I see that folded properties is needed to be used to achive that. 当我查看Elasticsearch文档时,我发现需要使用折叠属性来实现这一点。 But I couldn't understand how to do it with Nest attributes / functions. 但是我不明白如何使用Nest属性/函数。

BTW I'm using AutoMap to create mappings right now and if it is possible I want to be able to continue to use it. 顺便说一句,我现在正在使用自动映射来创建映射,如果可能的话,我希望能够继续使用它。

I'm searching for an answer for 2 days right now and couldn't figure it out yet. 我现在正在寻找答案2天,目前还无法解决。

What/where changes are required? 需要什么/在哪里进行更改? Can you provide me code sample(s)? 可以给我提供代码示例吗?

Thank you. 谢谢。

EDIT 1: 编辑1:

I figured out how to use analyzers to create sub fields of a property and achive results with term based query against sub fields. 我想出了如何使用分析器来创建属性的子字段并通过针对子字段的基于术语的查询来获得结果。

Now, I know I can do a multi field search but is there a way to include sub fields with full text search? 现在,我知道我可以进行多字段搜索,但是有没有办法在全文搜索中包含子字段?

Thank you. 谢谢。

You can configure an analyzer to perform analysis on the text at index time, index this into a multi_field to use at query time, as well as keep the original source to return in the result. 您可以配置分析器以在索引时对文本进行分析 ,将其索引到要在查询时使用的multi_field中,以及保留原始源以返回结果。 Based on what you have in your question, it sounds like you want a custom analyzer that uses the asciifolding token filter to convert to ASCII characters at index and search time. 根据您所遇到的问题,听起来像您想要一个自定义分析器,该分析器使用asciifolding令牌过滤器在索引和搜索时转换为ASCII字符。

Given the following document 鉴于以下文件

public class Document
{
    public int Id { get; set;}
    public string Name { get; set; }
}

Setting up a custom analyzer can be done when an index is created; 创建索引时可以完成自定义分析器的设置。 we can also specify the mapping at the same time 我们也可以同时指定映射

client.CreateIndex(documentsIndex, ci => ci
    .Settings(s => s
        .NumberOfShards(1)
        .NumberOfReplicas(0)
        .Analysis(analysis => analysis
            .TokenFilters(tokenfilters => tokenfilters
                .AsciiFolding("folding-preserve", ft => ft
                    .PreserveOriginal()
                )
            )
            .Analyzers(analyzers => analyzers
                .Custom("folding-analyzer", c => c
                    .Tokenizer("standard")
                    .Filters("standard", "folding-preserve")
                )
            )
        )
    )
    .Mappings(m => m
        .Map<Document>(mm => mm
            .AutoMap()
            .Properties(p => p
                .String(s => s
                    .Name(n => n.Name)
                    .Fields(f => f
                        .String(ss => ss
                            .Name("folding")
                            .Analyzer("folding-analyzer")
                        )
                    )
                    .NotAnalyzed()
                )
            )
        )
    )
);

Here I've created an index with one shard and no replicas (you may want to change this for your environment), and have created a custom analyzer, folding-analyzer that uses the standard tokenizer in conjunction with the standard token filter and a folding-preserve token filter that perform ascii folding, storing the original tokens in addition to the folded tokens (more on why this may be useful in a minute). 在这里,我创建了一个只有一个分片且没有副本的索引(您可能希望针对您的环境进行更改),并创建了一个自定义分析器, folding-analyzer ,该folding-analyzerstandard标记器与standard标记过滤器和folding-preserve结合使用folding-preserve执行ascii折叠的令牌过滤器,除了折叠的令牌外,还存储原始令牌(更多有关为什么这可能在一分钟内有用的信息)。

I've also mapped the Document type, mapping the Name property as a multi_field , with default field not_analyzed (useful for aggregations) and a .folding sub-field that will be analyzed with the folding-analyzer . 我还映射了Document类型,将Name属性映射为一个multi_field ,默认字段not_analyzed (用于聚合)和一个.folding 子字段 ,将使用folding-analyzer The original source document will also be stored by Elasticsearch by default. 默认情况下,原始源文档也将由Elasticsearch存储。

Now let's index some documents 现在让我们索引一些文档

client.Index<Document>(new Document { Id = 1, Name = "Ayse" });
client.Index<Document>(new Document { Id = 2, Name = "Ayşe" });

// refresh the index after indexing to ensure the documents just indexed are
// available to be searched
client.Refresh(documentsIndex);

Finally, searching for Ayşe 最后,搜索Ayşe

var response = client.Search<Document>(s => s
    .Query(q => q
        .QueryString(qs => qs
            .Fields(f => f
                .Field(c => c.Name.Suffix("folding"))
            )
            .Query("Ayşe")
        )
    )
);

yields 产量

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.163388,
    "hits" : [ {
      "_index" : "documents",
      "_type" : "document",
      "_id" : "2",
      "_score" : 1.163388,
      "_source" : {
        "id" : 2,
        "name" : "Ayşe"
      }
    }, {
      "_index" : "documents",
      "_type" : "document",
      "_id" : "1",
      "_score" : 0.3038296,
      "_source" : {
        "id" : 1,
        "name" : "Ayse"
      }
    } ]
  }
}

Two things to highlight here: 这里要强调的两件事:

Firstly, the _source contains the original text that was sent to Elasticsearch so by using response.Documents , you will get the original names, for example 首先, _source包含发送给Elasticsearch的原始文本,因此,通过使用response.Documents ,您将获得原始名称,例如

string.Join(",", response.Documents.Select(d => d.Name));

would give you "Ayşe,Ayse" 会给你“Ayşe,Ayse”

Secondly, remember that we preserved the original tokens in the asciifolding token filter? 其次,还记得我们将原始令牌保留在asiifolding令牌过滤器中吗? Doing so means that we can perform queries that undergo analysis to match accent insensitively but also take into account accent sensitivity when it comes to scoring; 这样做意味着我们可以执行经过分析的查询,以不敏感地匹配重音,但在评分时也要考虑重音; in the example above, the score for Ayşe matching Ayşe is higher than for Ayse matching Ayşe because the tokens Ayşe and Ayse are indexed for the former whilst only Ayse is indexed for the latter. 在上面的例子中,匹配艾谢费里德阿卡尔的得分艾谢费里德阿卡尔比为艾谢费里德阿卡尔匹配艾谢费里德阿卡尔因为令牌艾谢费里德阿卡尔艾谢费里德阿卡尔被索引为前,而仅艾谢费里德阿卡尔被索引为后者更高。 When a query that undergoes analysis is performed against the Name property, the query is analyzed with the folding-analyzer and a search for matches is performed 当针对Name属性执行要进行分析的查询时,将使用folding-analyzer对该查询进行folding-analyzer并执行对匹配项的搜索

Index time
----------

document 1 name: Ayse --analysis--> Ayse

document 2 name: Ayşe --analysis--> Ayşe, Ayse  


Query time
-----------

query_string query input: Ayşe --analysis--> Ayşe, Ayse

search for documents with tokens for name field matching Ayşe or Ayse 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM