[英]Apache Lucene TokenStream contract violation
Using Appache Lucene TokenStream to remove stopwords causes an error:使用 Appache Lucene TokenStream 删除停用词会导致错误:
TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
I use this code:我使用这个代码:
public static String removeStopWords(String string) throws IOException {
TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_47, new StringReader(string));
TokenFilter tokenFilter = new StandardFilter(Version.LUCENE_47, tokenStream);
TokenStream stopFilter = new StopFilter(Version.LUCENE_47, tokenFilter, StandardAnalyzer.STOP_WORDS_SET);
StringBuilder stringBuilder = new StringBuilder();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
while(stopFilter.incrementToken()) {
if(stringBuilder.length() > 0 ) {
stringBuilder.append(" ");
}
stringBuilder.append(token.toString());
}
stopFilter.end();
stopFilter.close();
return stringBuilder.toString();
}
But as you can see i never call reset() or close().但是正如你所看到的,我从不调用 reset() 或 close()。
So why am i getting this error?那么为什么我会收到这个错误呢?
i never call reset() or close().我从不调用 reset() 或 close()。
Well, that is your problem.嗯,那是你的问题。 If you care to read TokenStream
javadoc, you would find the following:如果您想阅读TokenStream
javadoc,您会发现以下内容:
The workflow of the new
TokenStream
API is as follows:新的TokenStream
API 的工作流程如下:
- Instantiation of
TokenStream
/TokenFilter
s which add/get attributes to/from theAttributeSource
.TokenStream
/TokenFilter
的实例化,它们向/从AttributeSource
添加/获取AttributeSource
。- The consumer calls
TokenStream#reset()
消费者调用TokenStream#reset()
- ... ...
I only had to add one line with reset()
to your code and it worked.我只需要在你的代码中添加一行reset()
就可以了。
...
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
tokenStream.reset(); // I added this
while(stopFilter.incrementToken()) {
...
I meet the error when reuse the same Tokenizer.重用相同的 Tokenizer 时遇到错误。 The reason just in the comments.原因就在评论里。 The solution is set a new reader or create a new Tokenizer.解决方案是设置一个新的阅读器或创建一个新的 Tokenizer。
/** Expert: Set a new reader on the Tokenizer. Typically, an
* analyzer (in its tokenStream method) will use
* this to re-use a previously created tokenizer. */
public final void setReader(Reader input) {
if (input == null) {
throw new NullPointerException("input must not be null");
} else if (this.input != ILLEGAL_STATE_READER) {
throw new IllegalStateException("TokenStream contract violation: close() call missing");
}
this.inputPending = input;
setReaderTestPoint();
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.