I have worked out how to use Lucene's Porter Stemmer but would like to also retrieve the original, un-stemmed word. So, to this end, I added a CharTermAttribute to the TokenStream before creating the PorterStemFilter, as follows:
Analyzer analyzer = new StandardAnalyzer();
TokenStream original = analyzer.tokenStream("StandardTokenStream", new StringReader(inputText));
TokenStream stemmed = new PorterStemFilter(original);
CharTermAttribute originalWordAttribute = original.addAttribute(CharTermAttribute.class);
CharTermAttribute stemmedWordAttribute = stemmed.addAttribute(CharTermAttribute.class);
stemmed.reset();
while (stemmed.incrementToken()) {
System.out.println(stemmedWordAttribute+" "+originalWordAttribute);
}
Unfortunately, both attributes return the stemmed word. Is there a way to get the original word as well?
Lucene's PorterStemFilter can be combined with Lucene's KeywordRepeatFilter . The Porter Stemmer uses this to provide both the stemmed and unstemmed tokens.
Modifying your approach:
Analyzer analyzer = new StandardAnalyzer();
TokenStream original = analyzer.tokenStream("StandardTokenStream", new StringReader(inputText));
TokenStream repeated = new KeywordRepeatFilter(original);
TokenStream stemmed = new PorterStemFilter(repeated);
CharTermAttribute stemmedWordAttribute = stemmed.addAttribute(CharTermAttribute.class);
stemmed.reset();
while (stemmed.incrementToken()) {
String originalWord = stemmedWordAttribute.toString();
stemmed.incrementToken();
String stemmedWord = stemmedWordAttribute.toString();
System.out.println(originalWord + " " + stemmedWord);
}
This is fairly crude, but shows the approach.
Example input:
testing giraffe book passing
Resulting output:
testing test
giraffe giraff
book book
passing pass
For each pair of tokens, if the second matches the first ( book book
), then there was no stemming.
Normally, you would use this with RemoveDuplicatesTokenFilter
to remove the duplicate book
term - but if you do that I think it becomes much harder to track the stemmed/unstemmed pairs - so for your specific scenario, I did not use that de-duplication filter.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.