[英]How to make an accent insensitive search in elasticsearch with nest c# client?
I'm an elasticsearch newbie. 我是Elasticsearch新手。
Lets say we have a class like this: 可以说我们有一个这样的类:
public class A
{
public string name;
}
And we have 2 documents which have names like "Ayşe" and "Ayse" . 我们有2个文档,其名称分别为“Ayşe”和“ Ayse” 。
Now, I want to be able to store names with their accents but when I search want to be able take results of accent insensitive query as accent sensitive results . 现在, 我希望能够存储带有重音符号的名称,但是当我搜索时希望能够将不重音符号查询结果作为重音符号结果 。
For ex: When I search for "Ayse" or "Ayşe" , it should return both "Ayşe" and "Ayse" as they stored (with accent). 例如:当我搜索“ Ayse”或“Ayşe”时 ,它应同时返回存储的“Ayşe”和“ Ayse” (带有重音符号)。
Right now when I search for "Ayse" it only returns "Ayse" but I want to have "Ayşe" as a result too. 现在,当我搜索“ Ayse”时,它仅返回“ Ayse”,但我也希望得到“Ayşe”。
When I checked elasticsearch documentation, I see that folded properties is needed to be used to achive that. 当我查看Elasticsearch文档时,我发现需要使用折叠属性来实现这一点。 But I couldn't understand how to do it with Nest attributes / functions. 但是我不明白如何使用Nest属性/函数。
BTW I'm using AutoMap to create mappings right now and if it is possible I want to be able to continue to use it. 顺便说一句,我现在正在使用自动映射来创建映射,如果可能的话,我希望能够继续使用它。
I'm searching for an answer for 2 days right now and couldn't figure it out yet. 我现在正在寻找答案2天,目前还无法解决。
What/where changes are required? 需要什么/在哪里进行更改? Can you provide me code sample(s)? 可以给我提供代码示例吗?
Thank you. 谢谢。
EDIT 1: 编辑1:
I figured out how to use analyzers to create sub fields of a property and achive results with term based query against sub fields. 我想出了如何使用分析器来创建属性的子字段并通过针对子字段的基于术语的查询来获得结果。
Now, I know I can do a multi field search but is there a way to include sub fields with full text search? 现在,我知道我可以进行多字段搜索,但是有没有办法在全文搜索中包含子字段?
Thank you. 谢谢。
You can configure an analyzer to perform analysis on the text at index time, index this into a multi_field to use at query time, as well as keep the original source to return in the result. 您可以配置分析器以在索引时对文本进行分析 ,将其索引到要在查询时使用的multi_field中,以及保留原始源以返回结果。 Based on what you have in your question, it sounds like you want a custom analyzer that uses the asciifolding
token filter to convert to ASCII characters at index and search time. 根据您所遇到的问题,听起来像您想要一个自定义分析器,该分析器使用asciifolding
令牌过滤器在索引和搜索时转换为ASCII字符。
Given the following document 鉴于以下文件
public class Document
{
public int Id { get; set;}
public string Name { get; set; }
}
Setting up a custom analyzer can be done when an index is created; 创建索引时可以完成自定义分析器的设置。 we can also specify the mapping at the same time 我们也可以同时指定映射
client.CreateIndex(documentsIndex, ci => ci
.Settings(s => s
.NumberOfShards(1)
.NumberOfReplicas(0)
.Analysis(analysis => analysis
.TokenFilters(tokenfilters => tokenfilters
.AsciiFolding("folding-preserve", ft => ft
.PreserveOriginal()
)
)
.Analyzers(analyzers => analyzers
.Custom("folding-analyzer", c => c
.Tokenizer("standard")
.Filters("standard", "folding-preserve")
)
)
)
)
.Mappings(m => m
.Map<Document>(mm => mm
.AutoMap()
.Properties(p => p
.String(s => s
.Name(n => n.Name)
.Fields(f => f
.String(ss => ss
.Name("folding")
.Analyzer("folding-analyzer")
)
)
.NotAnalyzed()
)
)
)
)
);
Here I've created an index with one shard and no replicas (you may want to change this for your environment), and have created a custom analyzer, folding-analyzer
that uses the standard tokenizer in conjunction with the standard
token filter and a folding-preserve
token filter that perform ascii folding, storing the original tokens in addition to the folded tokens (more on why this may be useful in a minute). 在这里,我创建了一个只有一个分片且没有副本的索引(您可能希望针对您的环境进行更改),并创建了一个自定义分析器, folding-analyzer
,该folding-analyzer
将standard
标记器与standard
标记过滤器和folding-preserve
结合使用folding-preserve
执行ascii折叠的令牌过滤器,除了折叠的令牌外,还存储原始令牌(更多有关为什么这可能在一分钟内有用的信息)。
I've also mapped the Document
type, mapping the Name
property as a multi_field
, with default field not_analyzed
(useful for aggregations) and a .folding
sub-field that will be analyzed with the folding-analyzer
. 我还映射了Document
类型,将Name
属性映射为一个multi_field
,默认字段not_analyzed
(用于聚合)和一个.folding
子字段 ,将使用folding-analyzer
。 The original source document will also be stored by Elasticsearch by default. 默认情况下,原始源文档也将由Elasticsearch存储。
Now let's index some documents 现在让我们索引一些文档
client.Index<Document>(new Document { Id = 1, Name = "Ayse" });
client.Index<Document>(new Document { Id = 2, Name = "Ayşe" });
// refresh the index after indexing to ensure the documents just indexed are
// available to be searched
client.Refresh(documentsIndex);
Finally, searching for Ayşe 最后,搜索Ayşe
var response = client.Search<Document>(s => s
.Query(q => q
.QueryString(qs => qs
.Fields(f => f
.Field(c => c.Name.Suffix("folding"))
)
.Query("Ayşe")
)
)
);
yields 产量
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.163388,
"hits" : [ {
"_index" : "documents",
"_type" : "document",
"_id" : "2",
"_score" : 1.163388,
"_source" : {
"id" : 2,
"name" : "Ayşe"
}
}, {
"_index" : "documents",
"_type" : "document",
"_id" : "1",
"_score" : 0.3038296,
"_source" : {
"id" : 1,
"name" : "Ayse"
}
} ]
}
}
Two things to highlight here: 这里要强调的两件事:
Firstly, the _source
contains the original text that was sent to Elasticsearch so by using response.Documents
, you will get the original names, for example 首先, _source
包含发送给Elasticsearch的原始文本,因此,通过使用response.Documents
,您将获得原始名称,例如
string.Join(",", response.Documents.Select(d => d.Name));
would give you "Ayşe,Ayse" 会给你“Ayşe,Ayse”
Secondly, remember that we preserved the original tokens in the asciifolding token filter? 其次,还记得我们将原始令牌保留在asiifolding令牌过滤器中吗? Doing so means that we can perform queries that undergo analysis to match accent insensitively but also take into account accent sensitivity when it comes to scoring; 这样做意味着我们可以执行经过分析的查询,以不敏感地匹配重音,但在评分时也要考虑重音; in the example above, the score for Ayşe matching Ayşe is higher than for Ayse matching Ayşe because the tokens Ayşe and Ayse are indexed for the former whilst only Ayse is indexed for the latter. 在上面的例子中,匹配艾谢费里德阿卡尔的得分艾谢费里德阿卡尔比为艾谢费里德阿卡尔匹配艾谢费里德阿卡尔因为令牌艾谢费里德阿卡尔和艾谢费里德阿卡尔被索引为前,而仅艾谢费里德阿卡尔被索引为后者更高。 When a query that undergoes analysis is performed against the Name
property, the query is analyzed with the folding-analyzer
and a search for matches is performed 当针对Name
属性执行要进行分析的查询时,将使用folding-analyzer
对该查询进行folding-analyzer
并执行对匹配项的搜索
Index time
----------
document 1 name: Ayse --analysis--> Ayse
document 2 name: Ayşe --analysis--> Ayşe, Ayse
Query time
-----------
query_string query input: Ayşe --analysis--> Ayşe, Ayse
search for documents with tokens for name field matching Ayşe or Ayse
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.