Solr（查询分析器）中NGramFilterFactory的Tokenize结果

Question

I'm using the NGramFilterFactory for indexing and querying. 我正在使用NGramFilterFactory进行索引和查询。

So if I'm searching for "overflow" it creates an query like this: 所以，如果我正在搜索“溢出”，它会创建一个这样的查询：

mySearchField:"ov ve ... erflow overflo verflow overflow"

But if I misspell "overflow", ie "owerflow" there are no matches, because the quotes around the query: 但如果我拼错“溢出”，即“owerflow”没有匹配，因为查询周围的引号：

mySearchField:"ow we ... erflow owerflo werflow owerflow"

Is it possible to tokenize the result of the NGramFilteFactory, that it'll creates an query like this: 是否有可能将NGramFilteFactory的结果标记化，它将创建一个如下查询：

mySearchField:"ow"
mySearchField:"we"
mySearchField:"erflow"
mySearchField:"owerflo"
mySearchField:"werflow"
mySearchField:"owerflow"

In this case solr would also find results, because the token "erflow" exists. 在这种情况下，solr也会找到结果，因为令牌“erflow”存在。

Answer 1

You don't need to tokenize your query like you wrote. 您不需要像编写的那样对查询进行标记。 Check if in your schema.xml you have the NGramFilterFactory applied at both index time and query time. 检查schema.xml是否在索引时和查询时应用了NGramFilterFactory 。 Then, the query parser you're using makes the difference. 然后，您正在使用的查询解析器会产生差异。 With LuceneQParser you'd get the result you're looking for, but not with DisMax and eDisMax . 使用LuceneQParser您将获得您正在寻找的结果，但不会使用DisMax和eDisMax 。

I checked the query mySearchField:owerflow with eDisMax and debugQuery=on : 我使用eDisMax和debugQuery=on检查了mySearchField:owerflow查询：

<str name="querystring">text:owerflow</str>
<str name="parsedquery">
+((text:o text:w text:e text:r text:f text:l text:o text:w text:ow text:we text:er text:rf text:fl text:lo text:ow text:owe text:wer text:erf text:rfl text:flo text:low text:ower text:werf text:erfl text:rflo text:flow text:owerf text:werfl text:erflo text:rflow text:owerfl text:werflo text:erflow text:owerflo text:werflow text:owerflow)~36)
</str>

If you look at the end of the generated query you'll see ~36 where 36 is the number of n-grams generated from your query. 如果查看生成的查询的结尾，您将看到~36其中36是从查询生成的n-gram数。 You don't get any results because of that ~36 , but you can change it through the mm parameter, which is the minimum should match. 因为那个~36你没有得到任何结果，但你可以通过mm参数改变它，这是最小匹配。

If you change the query to mySearchField:owerflow&mm=1 or a value lower than 25 you'll have the result you're looking for. 如果您将查询更改为mySearchField:owerflow&mm=1或低于25的值，您将获得您正在寻找的结果。

The difference between this answer and yours is that with EdgeNGramFilterFactory an infix query like mySearchField:werflow doesn't return any result, while it does with NGramFilterFactory . 这个答案和你的答案之间的区别在于，使用EdgeNGramFilterFactory ，像mySearchField:werflow EdgeNGramFilterFactory的中缀查询mySearchField:werflow不会返回任何结果，而是使用NGramFilterFactory 。

Anyway, If you're using the NGramFilterFactory for making spelling correction, I'd strongly recommend to have a look at the SpellCheckComponent as well, made exactly for that purpose. 总之，如果你使用NGramFilterFactory制作拼写校正，我强烈建议有在看SpellCheckComponent以及为此目的作出准确。

Answer 2

OK, I found a quick and easy way to solve the problem. 好的，我找到了解决问题的快捷方法。

The fieldType has an optional attribute autoGeneratePhraseQueries (Default=true). fieldType具有可选属性autoGeneratePhraseQueries（默认值= true）。 If I set autoGeneratePhraseQueries to false, everything works fine. 如果我将autoGeneratePhraseQueries设置为false，一切正常。

Explanation: 说明：

fieldType used in schema.xml: schema.xml中使用的fieldType：

<fieldType name="edgytext" class="solr.TextField" autoGeneratePhraseQueries="false">
 <analyzer type="index">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.WhiteSpaceTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
 </analyzer>
</fieldType>

If you are indexing the word "surprise", following tokens are in the index: 如果要索引单词“surprise”，则跟踪标记位于索引中：

s, su, ,sur, surp, surpr, surpri, surpris, surprise s，su，sur，surp，surpr，surpri，surprise，surprise

If you are search for "surpriese" (misspelled) solr creates following tokens (matching tokens are bold): 如果您正在搜索“surpriese”（拼写错误），则solr会创建以下标记（匹配的标记为粗体）：

s , su , sur , surp , surpr , surpri , surprie, surpries, surpriese s ， su ， sur ， surp ， surpr ， surpri ，surprie，surpries，surpriese

The real query which will be created looks like: 将要创建的真实查询如下所示：

mySearchField:s, mySearchField:su, mySearchField:sup .. and so on mySearchField：s，mySearchField：su，mySearchField：sup ..等等

But if you set autoGeneratePhraseQueries=true following query will be created: 但是如果你设置autoGeneratePhraseQueries = true，则会创建以下查询：

mySearchField:"s su surp supr surprie surpries surpriese" mySearchField：“s su surp supr surprie surpries surpriese”

This is an phrase query and does not match the indexed terms. 这是一个短语查询，与索引的术语不匹配。

Solr（查询分析器）中NGramFilterFactory的Tokenize结果

问题描述

2 个解决方案

解决方案1
4 2012-02-10 13:13:47

解决方案2
1 已采纳 2012-02-10 16:34:05

Solr（查询分析器）中NGramFilterFactory的Tokenize结果

问题描述

2 个解决方案

解决方案1 4 2012-02-10 13:13:47

解决方案2 1 已采纳 2012-02-10 16:34:05

解决方案1
4 2012-02-10 13:13:47

解决方案2
1 已采纳 2012-02-10 16:34:05