简体   繁体   English

SOLR WordDelimiterFilterFactory

[英]SOLR WordDelimiterFilterFactory

I use WordDelimiterFilterFactory to split words that have numbers into solr tokens.我使用 WordDelimiterFilterFactory 将具有数字的单词拆分为 solr 标记。 For example the word Php5 is split in two tokens "PHP" , "5" .When searching, the request that is executed by SOLR is q="php" and q="5".例如单词 Php5 被拆分为两个标记"PHP" , "5" 。搜索时,SOLR 执行的请求是 q="php" 和 q="5"。 But this request finds even results with "5" only.但是这个请求只找到了“5”的结果。 What I want is to find documents with "PHP5" or "PHP 5" only.我想要的是仅查找带有“PHP5”“PHP 5”的文档。

If someone has any idea to get around this please.如果有人有任何想法来解决这个问题。

Hope it is clear.希望它很清楚。

Thank's.谢谢。

You need to get solr, in addition to indexing "php5", to index "php 5" as a single token.除了索引“php5”之外,您还需要获取 solr 以将“php 5”索引为单个标记。 That way a search for "php 5" will match but a search for "blah 5" will not, for example.例如,这样搜索“php 5”将匹配但搜索“blah 5”不会匹配。

The only way I was able to get this to work well was to use the Auto Phrasing filter by lucid work s.我能够让它正常工作的唯一方法是使用lucid workAuto Phrasing 过滤器

    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
        />
        <filter class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory" phrases="autophrases.txt" includeTokens="true" replaceWhitespaceWith="_" />  
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
        />
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>

synonyms.txt同义词.txt

php5,php_5

protwords.txt (so the delimiter doesn't break it) protwords.txt(所以分隔符不会破坏它)

php5,php_5

You also have to change the query parser to use the lucid parser.您还必须更改查询解析器以使用 lucid 解析器。

solrconfig.xml配置文件

<queryParser name="autophrasingParser" class="com.lucidworks.analysis.AutoPhrasingQParserPlugin" >
  <str name="phrases">autophrases.txt</str>
  <str name="replaceWhitespaceWith">_</str>
  <str name="ignoreCase">false</str>
</queryParser> 
<requestHandler name="/searchp" class="solr.SearchHandler">
    <lst name="defaults">
         <str name="echoParams">explicit</str>
         <int name="rows">10</int>
         <str name="df">Keywords</str>
         <str name="defType">autophrasingParser</str>
    </lst>
</requestHandler>  

autophrases.txt自述.txt

php 5

The filter can be found here: https://github.com/LucidWorks/auto-phrase-tokenfilter过滤器可以在这里找到: https : //github.com/LucidWorks/auto-phrase-tokenfilter

This article was also very helpful: http://lucidworks.com/2014/07/02/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/这篇文章也很有帮助: http : //lucidworks.com/2014/07/02/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/

This filter splits tokens at word delimiters.此过滤器在单词分隔符处拆分标记。

In your case you can opt for splitOnNumerics="0" , so it wont spilt on numbers.在您的情况下,您可以选择splitOnNumerics="0" ,因此它不会溢出数字。

splitOnNumerics : splitOnNumerics :

(integer, default 1) If 0, don't split words on transitions from alpha to numeric:"FemBot3000" -> "Fem", "Bot3000" (整数,默认 1)如果为 0,则在从字母到数字的转换时不拆分单词:"FemBot3000" -> "Fem", "Bot3000"

The rules for determining delimiters are determined in the below link确定分隔符的规则在以下链接中确定

https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM