[英]Lucene Porter Stemmer - get original unstemmed word
I have worked out how to use Lucene's Porter Stemmer but would like to also retrieve the original, un-stemmed word.我已经研究出如何使用 Lucene 的 Porter Stemmer,但还想检索原始的、未词干的词。 So, to this end, I added a CharTermAttribute to the TokenStream before creating the PorterStemFilter, as follows:因此,为此,我在创建 PorterStemFilter 之前向 TokenStream 添加了 CharTermAttribute,如下所示:
Analyzer analyzer = new StandardAnalyzer();
TokenStream original = analyzer.tokenStream("StandardTokenStream", new StringReader(inputText));
TokenStream stemmed = new PorterStemFilter(original);
CharTermAttribute originalWordAttribute = original.addAttribute(CharTermAttribute.class);
CharTermAttribute stemmedWordAttribute = stemmed.addAttribute(CharTermAttribute.class);
stemmed.reset();
while (stemmed.incrementToken()) {
System.out.println(stemmedWordAttribute+" "+originalWordAttribute);
}
Unfortunately, both attributes return the stemmed word.不幸的是,这两个属性都返回词干。 Is there a way to get the original word as well?有没有办法获得原始单词?
Lucene's PorterStemFilter can be combined with Lucene's KeywordRepeatFilter . Lucene 的PorterStemFilter可以与 Lucene 的KeywordRepeatFilter结合使用。 The Porter Stemmer uses this to provide both the stemmed and unstemmed tokens. Porter Stemmer 使用它来提供词干和非词干标记。
Modifying your approach:修改你的方法:
Analyzer analyzer = new StandardAnalyzer();
TokenStream original = analyzer.tokenStream("StandardTokenStream", new StringReader(inputText));
TokenStream repeated = new KeywordRepeatFilter(original);
TokenStream stemmed = new PorterStemFilter(repeated);
CharTermAttribute stemmedWordAttribute = stemmed.addAttribute(CharTermAttribute.class);
stemmed.reset();
while (stemmed.incrementToken()) {
String originalWord = stemmedWordAttribute.toString();
stemmed.incrementToken();
String stemmedWord = stemmedWordAttribute.toString();
System.out.println(originalWord + " " + stemmedWord);
}
This is fairly crude, but shows the approach.这相当粗糙,但显示了方法。
Example input:示例输入:
testing giraffe book passing
Resulting output:结果输出:
testing test
giraffe giraff
book book
passing pass
For each pair of tokens, if the second matches the first ( book book
), then there was no stemming.对于每对标记,如果第二个与第一个( book book
)匹配,则没有词干。
Normally, you would use this with RemoveDuplicatesTokenFilter
to remove the duplicate book
term - but if you do that I think it becomes much harder to track the stemmed/unstemmed pairs - so for your specific scenario, I did not use that de-duplication filter.通常,您会将此与RemoveDuplicatesTokenFilter
一起使用来删除重复的book
术语 - 但如果您这样做,我认为跟踪词干/非词干对会变得更加困难-因此对于您的特定情况,我没有使用该重复数据删除过滤器。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.