简体繁体 English

Azure 认知搜索 - 您什么时候会使用不同的搜索和索引分析器？

[英]Azure Cognitive Search - When would you use different search and index analyzers?

原文 2022-11-12 00:52:34 0 1 azure/ lucene/ full-text-search/ azure-cognitive-search/ lexical-analysis

I'm trying to understand what is the purpose of configuring a different analyzer for searching and indexing in Azure Search.我试图了解在 Azure 搜索中配置不同的分析器进行搜索和索引的目的是什么。 See: https://learn.microsoft.com/en-us/rest/api/searchservice/create-index#-field-definitions-请参阅： https://learn.microsoft.com/en-us/rest/api/searchservice/create-index#-field-definitions-

According to my understanding, the job of the indexing analyzer is to breakup the input document into individual tokens.根据我的理解，索引分析器的工作是将输入文档分解成单独的标记。 Through this process, it might apply multiple transformations like lower-casing the content, removing punctuation and white-spaces, and even removing entire words.通过这个过程，它可能会应用多种转换，例如小写内容、删除标点符号和空格，甚至删除整个单词。

If the tokens are already processed, what is the use of the search analyzer?如果已经处理了标记，那么搜索分析器有什么用？

Initially, I thought it would apply a similar process on the search query itself, but wouldn't setting a different analyzer than the one used to index the document at this stage completely breaks the search results?最初，我认为它会对搜索查询本身应用类似的过程，但是在这个阶段设置与用于索引文档的分析器不同的分析器不会完全破坏搜索结果吗？ If the indexing analyzer lower-cased everything, but the search analyzer doesn't lower-case the query, wouldn't that means you'll never get matches for queries with upper case characters?如果索引分析器将所有内容都小写，但搜索分析器不将查询小写，这是否意味着您将永远无法匹配大写字符的查询？ What if the search analyzer doesn't split tokens on white-spaces?如果搜索分析器不在空格上拆分标记怎么办？ Won't you ever get a match the moment the query includes a space?当查询包含空格时，您永远不会得到匹配吗？

Assuming that this is indeed how the two analyzers works together, then why would you ever want to set two different ones?假设这确实是两个分析器协同工作的方式，那么您为什么要设置两个不同的分析器呢？

1 个解决方案

Your understanding of the difference between index and search analyzer is correct.您对索引和搜索分析器之间区别的理解是正确的。 An example scenario where that's valuable is using ngrams for indexing but not for search terms.一个有价值的示例场景是使用 ngrams 进行索引而不是搜索词。 So this would allow a document with "cat" to produce "c", "ca", "cat" but you wouldn't necessarily want to apply ngrams on the search term as that would make the query less performant and isn't necessary since the documents already produced the ngrams.因此，这将允许带有“cat”的文档生成“c”、“ca”、“cat”，但您不一定要在搜索词上应用 ngram，因为这会使查询性能降低并且没有必要因为文档已经生成了 ngram。 Hopefully that makes sense!希望这是有道理的！