SOLR WordDelimiterFilterFactory

Question

I use WordDelimiterFilterFactory to split words that have numbers into solr tokens.我使用 WordDelimiterFilterFactory 将具有数字的单词拆分为 solr 标记。 For example the word Php5 is split in two tokens "PHP" , "5" .When searching, the request that is executed by SOLR is q="php" and q="5".例如单词 Php5 被拆分为两个标记"PHP" , "5" 。搜索时，SOLR 执行的请求是 q="php" 和 q="5"。 But this request finds even results with "5" only.但是这个请求只找到了“5”的结果。 What I want is to find documents with "PHP5" or "PHP 5" only.我想要的是仅查找带有“PHP5”或“PHP 5”的文档。

If someone has any idea to get around this please.如果有人有任何想法来解决这个问题。

Hope it is clear.希望它很清楚。

Thank's.谢谢。

Answer 1

You need to get solr, in addition to indexing "php5", to index "php 5" as a single token.除了索引“php5”之外，您还需要获取 solr 以将“php 5”索引为单个标记。 That way a search for "php 5" will match but a search for "blah 5" will not, for example.例如，这样搜索“php 5”将匹配但搜索“blah 5”不会匹配。

The only way I was able to get this to work well was to use the Auto Phrasing filter by lucid work s.我能够让它正常工作的唯一方法是使用lucid work的Auto Phrasing 过滤器。

    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
        />
        <filter class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory" phrases="autophrases.txt" includeTokens="true" replaceWhitespaceWith="_" />  
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
        />
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>

synonyms.txt同义词.txt

php5,php_5

protwords.txt (so the delimiter doesn't break it) protwords.txt（所以分隔符不会破坏它）

php5,php_5

You also have to change the query parser to use the lucid parser.您还必须更改查询解析器以使用 lucid 解析器。

solrconfig.xml配置文件

<queryParser name="autophrasingParser" class="com.lucidworks.analysis.AutoPhrasingQParserPlugin" >
  <str name="phrases">autophrases.txt</str>
  <str name="replaceWhitespaceWith">_</str>
  <str name="ignoreCase">false</str>
</queryParser> 
<requestHandler name="/searchp" class="solr.SearchHandler">
    <lst name="defaults">
         <str name="echoParams">explicit</str>
         <int name="rows">10</int>
         <str name="df">Keywords</str>
         <str name="defType">autophrasingParser</str>
    </lst>
</requestHandler>

autophrases.txt自述.txt

php 5

The filter can be found here: https://github.com/LucidWorks/auto-phrase-tokenfilter过滤器可以在这里找到： https : //github.com/LucidWorks/auto-phrase-tokenfilter

This article was also very helpful: http://lucidworks.com/2014/07/02/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/这篇文章也很有帮助： http : //lucidworks.com/2014/07/02/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/

Answer 2

This filter splits tokens at word delimiters.此过滤器在单词分隔符处拆分标记。

In your case you can opt for splitOnNumerics="0" , so it wont spilt on numbers.在您的情况下，您可以选择splitOnNumerics="0" ，因此它不会溢出数字。

splitOnNumerics : splitOnNumerics :

(integer, default 1) If 0, don't split words on transitions from alpha to numeric:"FemBot3000" -> "Fem", "Bot3000" （整数，默认 1）如果为 0，则在从字母到数字的转换时不拆分单词："FemBot3000" -> "Fem", "Bot3000"

The rules for determining delimiters are determined in the below link确定分隔符的规则在以下链接中确定

https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter

SOLR WordDelimiterFilterFactory

问题描述

2 个解决方案

解决方案1
1 2015-10-08 09:00:00

解决方案2
0 2015-08-27 12:57:39

SOLR WordDelimiterFilterFactory

问题描述

2 个解决方案

解决方案1 1 2015-10-08 09:00:00

解决方案2 0 2015-08-27 12:57:39

解决方案1
1 2015-10-08 09:00:00

解决方案2
0 2015-08-27 12:57:39