简体   繁体   English

Solr(查询分析器)中NGramFilterFactory的Tokenize结果

[英]Tokenize result of a NGramFilterFactory in Solr (query analyzer)

I'm using the NGramFilterFactory for indexing and querying. 我正在使用NGramFilterFactory进行索引和查询。

So if I'm searching for "overflow" it creates an query like this: 所以,如果我正在搜索“溢出”,它会创建一个这样的查询:

mySearchField:"ov ve ... erflow overflo verflow overflow"

But if I misspell "overflow", ie "owerflow" there are no matches, because the quotes around the query: 但如果我拼错“溢出”,即“owerflow”没有匹配,因为查询周围的引号:

mySearchField:"ow we ... erflow owerflo werflow owerflow"

Is it possible to tokenize the result of the NGramFilteFactory, that it'll creates an query like this: 是否有可能将NGramFilteFactory的结果标记化,它将创建一个如下查询:

mySearchField:"ow"
mySearchField:"we"
mySearchField:"erflow"
mySearchField:"owerflo"
mySearchField:"werflow"
mySearchField:"owerflow"

In this case solr would also find results, because the token "erflow" exists. 在这种情况下,solr也会找到结果,因为令牌“erflow”存在。

You don't need to tokenize your query like you wrote. 您不需要像编写的那样对查询进行标记。 Check if in your schema.xml you have the NGramFilterFactory applied at both index time and query time. 检查schema.xml是否在索引时和查询时应用了NGramFilterFactory Then, the query parser you're using makes the difference. 然后,您正在使用的查询解析器会产生差异。 With LuceneQParser you'd get the result you're looking for, but not with DisMax and eDisMax . 使用LuceneQParser您将获得您正在寻找的结果,但不会使用DisMaxeDisMax

I checked the query mySearchField:owerflow with eDisMax and debugQuery=on : 我使用eDisMaxdebugQuery=on检查了mySearchField:owerflow查询:

<str name="querystring">text:owerflow</str>
<str name="parsedquery">
+((text:o text:w text:e text:r text:f text:l text:o text:w text:ow text:we text:er text:rf text:fl text:lo text:ow text:owe text:wer text:erf text:rfl text:flo text:low text:ower text:werf text:erfl text:rflo text:flow text:owerf text:werfl text:erflo text:rflow text:owerfl text:werflo text:erflow text:owerflo text:werflow text:owerflow)~36)
</str>

If you look at the end of the generated query you'll see ~36 where 36 is the number of n-grams generated from your query. 如果查看生成的查询的结尾,您将看到~36其中36是从查询生成的n-gram数。 You don't get any results because of that ~36 , but you can change it through the mm parameter, which is the minimum should match. 因为那个~36你没有得到任何结果,但你可以通过mm参数改变它,这是最小匹配。

If you change the query to mySearchField:owerflow&mm=1 or a value lower than 25 you'll have the result you're looking for. 如果您将查询更改为mySearchField:owerflow&mm=1或低于25的值,您将获得您正在寻找的结果。

The difference between this answer and yours is that with EdgeNGramFilterFactory an infix query like mySearchField:werflow doesn't return any result, while it does with NGramFilterFactory . 这个答案和你的答案之间的区别在于,使用EdgeNGramFilterFactory ,像mySearchField:werflow EdgeNGramFilterFactory的中缀查询mySearchField:werflow不会返回任何结果,而是使用NGramFilterFactory

Anyway, If you're using the NGramFilterFactory for making spelling correction, I'd strongly recommend to have a look at the SpellCheckComponent as well, made exactly for that purpose. 总之,如果你使用NGramFilterFactory制作拼写校正,我强烈建议有在看SpellCheckComponent以及为此目的作出准确。

OK, I found a quick and easy way to solve the problem. 好的,我找到了解决问题的快捷方法。

The fieldType has an optional attribute autoGeneratePhraseQueries (Default=true). fieldType具有可选属性autoGeneratePhraseQueries(默认值= true)。 If I set autoGeneratePhraseQueries to false, everything works fine. 如果我将autoGeneratePhraseQueries设置为false,一切正常。

Explanation: 说明:

fieldType used in schema.xml: schema.xml中使用的fieldType:

<fieldType name="edgytext" class="solr.TextField" autoGeneratePhraseQueries="false">
 <analyzer type="index">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.WhiteSpaceTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
 </analyzer>
</fieldType>

If you are indexing the word "surprise", following tokens are in the index: 如果要索引单词“surprise”,则跟踪标记位于索引中:

s, su, ,sur, surp, surpr, surpri, surpris, surprise s,su,sur,surp,surpr,surpri,surprise,surprise

If you are search for "surpriese" (misspelled) solr creates following tokens (matching tokens are bold): 如果您正在搜索“surpriese”(拼写错误),则solr会创建以下标记(匹配的标记为粗体):

s , su , sur , surp , surpr , surpri , surprie, surpries, surpriese ssusursurpsurprsurpri ,surprie,surpries,surpriese

The real query which will be created looks like: 将要创建的真实查询如下所示:

mySearchField:s, mySearchField:su, mySearchField:sup .. and so on mySearchField:s,mySearchField:su,mySearchField:sup ..等等

But if you set autoGeneratePhraseQueries=true following query will be created: 但是如果你设置autoGeneratePhraseQueries = true,则会创建以下查询:

mySearchField:"s su surp supr surprie surpries surpriese" mySearchField:“s su surp supr surprie surpries surpriese”

This is an phrase query and does not match the indexed terms. 这是一个短语查询,与索引的术语不匹配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM