简体   繁体   中英

SOLR: SynonymFilterFactory with stemming

If I understand correctly, the SynonymFilterFactory does not stem synonyms in any way. As such, one would have to be pretty exhaustive with plurals and tenses in their synonym file if they want good recall regardless of pluralization/tensing.

I see that the SynonymFilterFactory has an optional argument where it can accept an analyzer.

analyzer: (optional; default: WhitespaceTokenizerFactory) The name of the analyzer class to use when parsing the synonyms file. If analyzer is specified, then tokenizerFactory may not be, and vice versa.

I doubt that nesting the desired analyzer like so is valid:

<analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.SnowballPorterFilterFactory" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" > 
        <analyzer>
            <tokenizer class="solr.WhitespaceTokenizerFactory" />
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.SnowballPorterFilterFactory" />
            <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
        </analyzer>
    </filter>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>

I suspect that compiling an extension analyzer .jar and placing it in SOLR's lib folder might be the only way to do this. Is there a way to define a named analyzer in configuration, or another method to accomplish this goal?

This does not answer my original question (about how to do this via configuration only), but is the solution I ended up using in the event that anyone else wants to do it.

First, a custom analyzer that will be used to pre-process the synonyms coming in from the Synonym filter (most importantly, stemming them with Snowball):

public class SnowballAnalyzer extends Analyzer {
    /**
     * Creates a
     * {@link org.apache.lucene.analysis.Analyzer.TokenStreamComponents} which
     * tokenizes text when given a reader.
     * 
     * @return A
     *         {@link org.apache.lucene.analysis.Analyzer.TokenStreamComponents}
     *         built from an {@link WhitespaceTokenizer} filtered with
     *         {@link LowerCaseFilter} and English {@link SnowballFilter}.
     */
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer source = new WhitespaceTokenizer();
        TokenStream filter = new LowerCaseFilter(source);
        filter = new SnowballFilter(filter, "English");
        return new TokenStreamComponents(source, filter);
    }

}

This is extracted as a .jar and deployed into your SOLR home's lib directory. Next, make sure to tell SOLR to use this analyzer in your synonym filter (usually in schema.xml or managed-schema):

<fieldType name="stemmedText" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
        <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" generateWordParts="1" catenateAll="0" catenateWords="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
        <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" generateNumberParts="1" generateWordParts="1" catenateAll="0" catenateWords="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" expand="true" analyzer="your.package.SnowballAnalyzer" ignoreCase="true" synonyms="synonyms.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>

And finally, using the above type on whatever fields you want:

<field name="keywords" type="stemmedText" indexed="true" stored="false"/>

With this example, documents' keywords fields will be stemmed in the index. When a query is done on that field, the term will be stemmed then used to look up synonyms (that are already pre-stemmed by the custom analyzer). The result is that a synonym file containing "incomplete" synonym list (plurals, tenses) has a much higher chance of getting a match.

Specific example

Synonym file entry: [dog,doggy,dogs,canids,canid,puppy,pups,pup]

Search term: puppies (notice that it's not in the synonym list)

Parsed query: SynonymQuery(Synonym(keywords:canid keywords:dog keywords:doggi keywords:pup keywords:puppi))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM