简体   繁体   中英

solr stemming, stop words and shingles not giving expected outputs

I am trying to remove the unwanted words and use stemming and finally create shingles. However, after removing stop words, its giving me shingles with "_" in the place of stop words. I tried using PatternReplaceFactory to replace _ but its not working. I have field type as below:

<fieldType name="common_shingle" class="solr.TextField">
    <analyzer type="index">
          <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern=".*_.*" replacement=""/>
        <filter class="solr.ShingleFilterFactory" outputUnigrams="false" minShingleSize="3" maxShingleSize="3"/>            
    </analyzer>
</fieldType>

And when I analyse "A brown fox quickly jumps over the lazy dog". It gives me following result:

  1. _ brown fox
  2. brown fox quickli
  3. fox quickli jump
  4. quickli jump _
  5. jump _ _
  6. _ _ lazi
  7. _ lazi dog

How do I remove _ from the shingle token. Also, is there a way to create shingles only from stop words?

Thats because of stopwords Set PositionIncrements to False and luceneMatchVersion to 4.3

Replace your StopFilterFactory with this.

  <filter class="solr.StopFilterFactory" luceneMatchVersion="4.3" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>

In the SOLR's Jira there is an improvement request with an available patch: https://issues.apache.org/jira/browse/SOLR-11604

Compile a new lucene-analyzers-common.jar with this patch and use the skipFillerTokens="true" option in your schema.xml

<filter class="solr.ShingleFilterFactory" ... skipFillerTokens="true"/>

If you want this patch to be included in the next SOLR version, vote for this Jira issue.

The _ is inserted by the ShingleFilter, as it replaces empty position increments with the token _ .

If you want to remove the value, you'll have to perform the PatternReplace after the ShingleFilter, as it doesn't exist in the token stream before that.

ElasticSearch exposes an option to select the replacement character as "fillter_token", but Solr's implementation seem to directly use the Lucene implementation, so you should be able to use fillerToken to set this yourself. Try doing fillerToken="" in your ShingleFilter definition, instead of using the patternreplacefilter.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM