How to use pattern tokenizer to index only words that start with capital letter in lucene

Question

I'm using Lucene 5.1.0, I want my index writer to only index the terms that start with a capital letter. I looked into custom analyzers and pattern tokenizer, but I couldn't understand how to use those in order to index only the words that start (or all of the letters) with a capital letter. Any help would be appreciated

Answer 1

I found this link helpful for wrapping my head around custom tokenizers/analyzers/filters: http://www.citrine.io/blog/2015/2/14/building-a-custom-analyzer-in-lucene

However, in your case I think it's easier to extend org.apache.lucene.analysis.util.FilteringTokenFilter instead of TokenFilter :

public class StartsWithCapitalTokenFilter extends FilteringTokenFilter {

    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public StartsWithCapitalTokenFilter(TokenStream tokenStream) {
        super(tokenStream);
    }

    @Override
    public boolean accept() {
        // When accept() is called, my understanding is that termAtt.buffer() will
        // contain the particular string (in char[] form) of whichever token
        // is under consideration. This call gets the Unicode code point of the
        // first character and checks if it's uppercase.
        return Character.isUpperCase(Character.codePointAt(termAtt.buffer(),0));

        // Or if you don't want to care about Unicode about U+FFFF, use the below.
        //return Character.isUpperCase(termAtt.buffer()[0]);
    }
}

Then you'll need some kind of custom Analyzer to make use of the filter. This one uses only the new filter:

public class StartswithCapitalAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(String field, Reader reader) {
        Tokenizer tokenizer = new StandardTokenizer();
        TokenStream filter = new StartsWithCapitalTokenFilter(tokenizer);

        // chain any other filters you want in here, like so:
        //filter = new LowerCaseFilter(filter);

        return new TokenStreamComponents(tokenizer, filter);
    }
}

That should all be functional, though I don't have an environment to test it out right now. Good luck!

How to use pattern tokenizer to index only words that start with capital letter in lucene

Question

1 answers

solution1
4 ACCPTED 2015-08-10 23:23:02

How to use pattern tokenizer to index only words that start with capital letter in lucene

Question

1 answers

solution1 4 ACCPTED 2015-08-10 23:23:02

solution1
4 ACCPTED 2015-08-10 23:23:02