Apache Lucene TokenStream 合同违规

Question

Using Appache Lucene TokenStream to remove stopwords causes an error:使用 Appache Lucene TokenStream 删除停用词会导致错误：

TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.

I use this code:我使用这个代码：

public static String removeStopWords(String string) throws IOException {
    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_47, new StringReader(string));
    TokenFilter tokenFilter = new StandardFilter(Version.LUCENE_47, tokenStream);
    TokenStream stopFilter = new StopFilter(Version.LUCENE_47, tokenFilter, StandardAnalyzer.STOP_WORDS_SET);
    StringBuilder stringBuilder = new StringBuilder();

    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

    while(stopFilter.incrementToken()) {
        if(stringBuilder.length() > 0 ) {
            stringBuilder.append(" ");
        }

        stringBuilder.append(token.toString());
    }

    stopFilter.end();
    stopFilter.close();

    return stringBuilder.toString();
}

But as you can see i never call reset() or close().但是正如你所看到的，我从不调用 reset() 或 close()。

So why am i getting this error?那么为什么我会收到这个错误呢？

Answer 1

i never call reset() or close().我从不调用 reset() 或 close()。

Well, that is your problem.嗯，那是你的问题。 If you care to read TokenStream javadoc, you would find the following:如果您想阅读TokenStream javadoc，您会发现以下内容：

The workflow of the new TokenStream API is as follows:新的TokenStream API 的工作流程如下：

Instantiation of TokenStream / TokenFilter s which add/get attributes to/from the AttributeSource . TokenStream / TokenFilter的实例化，它们向/从AttributeSource添加/获取AttributeSource 。

The consumer calls TokenStream#reset()消费者调用TokenStream#reset()

... ...

I only had to add one line with reset() to your code and it worked.我只需要在你的代码中添加一行reset()就可以了。

...    
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
tokenStream.reset();   // I added this 
while(stopFilter.incrementToken()) {
...

Answer 2

I meet the error when reuse the same Tokenizer.重用相同的 Tokenizer 时遇到错误。 The reason just in the comments.原因就在评论里。 The solution is set a new reader or create a new Tokenizer.解决方案是设置一个新的阅读器或创建一个新的 Tokenizer。

  /** Expert: Set a new reader on the Tokenizer.  Typically, an
   *  analyzer (in its tokenStream method) will use
   *  this to re-use a previously created tokenizer. */
  public final void setReader(Reader input) {
    if (input == null) {
      throw new NullPointerException("input must not be null");
    } else if (this.input != ILLEGAL_STATE_READER) {
      throw new IllegalStateException("TokenStream contract violation: close() call missing");
    }
    this.inputPending = input;
    setReaderTestPoint();
  }

Apache Lucene TokenStream 合同违规

问题描述

2 个解决方案

解决方案1
8 已采纳 2014-06-02 09:25:49

解决方案2
0 2020-01-27 16:55:30

Apache Lucene TokenStream 合同违规

问题描述

2 个解决方案

解决方案1 8 已采纳 2014-06-02 09:25:49

解决方案2 0 2020-01-27 16:55:30

解决方案1
8 已采纳 2014-06-02 09:25:49

解决方案2
0 2020-01-27 16:55:30