简体   繁体   English

Lucene TokenStream异常

[英]Lucene TokenStream Exception

I've got exactly the same problem of this thread , so I'm opening a new question. 对此线程有完全相同的问题,所以我要提出一个新问题。 Sorry to all for answered to the linked thread, BTW. 对不起,对所链接线程BTW的答复。

So: I'm trying to avoid the java.lang.IllegalStateException: TokenStream contract violation. 因此:我试图避免java.lang.IllegalStateException:TokenStream合同违规。

I had a code very similar to the above linked: 我有一个非常类似于上面链接的代码:

protected TokenStreamComponents createComponents( String fieldName, Reader reader ) {

String token;
CharArraySet stopWords = new CharArraySet( Version.LUCENE_48, 0, false );
stopWords.addAll( StopAnalyzer.ENGLISH_STOP_WORDS_SET );
keepWords.addAll( getKeepWordList() );

Tokenizer source = new StandardTokenizer( Version.LUCENE_48, reader );
TokenStream filter = new StandardFilter( Version.LUCENE_48, source );
filter = new StopFilter( Version.LUCENE_48, filter, stopWords );
ShingleFilter shiFilter = new ShingleFilter( filter, 2, 3 );
CharTermAttribute cta = shiFilter.addAttribute( CharTermAttribute.class );

try {
    shiFilter.reset();
    while( shiFilter.incrementToken() ) {

        token = cta.toString();
        System.out.println( token );
    }
    shiFilter.end();
    shiFilter.close();
} 
catch ( IOException ioe ) {

    ioe.printStackTrace();
}
return new TokenStreamComponents( source, filter );
}

I don't understand the proposed solution: what does it means "simply construct a new TokenStream" or "Resetting the reader"? 我不理解所提出的解决方案:“简单地构建新的TokenStream”或“重置阅读器”是什么意思? I've tried both solution, like adding: 我已经尝试了两种解决方案,例如添加:

source.setReader( reader );

Or change to: 或更改为:

filter = new StopFilter( Version.LUCENE_48, filter, stopWords );
ShingleFilter shiFilter = new ShingleFilter( filter, 2, 3 );

But the error last. 但是错误持续。 Any suggestion? 有什么建议吗?

I didn't understand exactly what you are trying to do. 我不完全了解您要做什么。 I believe that you would want to get the get the bigrams and trigrams from the token stream in addition to the unigrams. 我相信,除了单字组之外,您还希望从令牌流中获取二元组和三元组。 The following code fragment (which I developed after a bit of cleaning of your code) runs for me and is the standard way of doing this. 下面的代码片段(在我对代码进行了一些清理之后开发了)对我来说是运行的,并且是实现此目的的标准方法。

import java.io.*;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.ngram.NGramTokenFilter;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.util.*;
import org.apache.lucene.analysis.util.*;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

class TestAnalyzer extends Analyzer {

TestAnalyzer() {
    super();
}

protected TokenStreamComponents createComponents( String fieldName, Reader reader ) {
    String token;
    TokenStream result = null;

    Tokenizer source = new StandardTokenizer( Version.LUCENE_CURRENT, reader );
    result = new ShingleFilter(source, 2, 3);

    return new TokenStreamComponents( source, result );
}
}

public class LuceneTest {

public static void main(String[] args) {

    TestAnalyzer analyzer = new TestAnalyzer();

    try {
        TokenStream stream = analyzer.tokenStream("field", new StringReader("This is a damn test."));
        CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);

        stream.reset();

        // print all tokens until stream is exhausted
        while (stream.incrementToken()) {
            System.out.println(termAtt.toString());
        }

        stream.end();
        stream.close();
     }
     catch (Exception ex) {
         ex.printStackTrace();
     }
}

} }

Your problem is that you are consuming everything in the filter and closing it before passing it back from the createComponents call. 您的问题是您要消耗掉过滤器中的所有内容并将其关闭,然后再将其从createComponents调用传回。

You are, I assume, trying to debug how analysis works with this: 我认为您正在尝试调试分析的工作方式:

try {
    shiFilter.reset();
    while( shiFilter.incrementToken() ) {
        token = cta.toString();
        System.out.println( token );
    }
    shiFilter.end();
    shiFilter.close();
} 
catch ( IOException ioe ) {
    ioe.printStackTrace();
}

Note, though, that when you are done with it, shiFilter is at the end of the stream, and has been closed. 但是请注意,完成后, shiFilter位于流的末尾,并且已关闭。 You now pass it back from the method, in the TokenStreamComponents, when Lucene will now attempt to use it to index the documents. 现在,当Lucene现在尝试使用它为文档建立索引时,您可以从TokenStreamComponents中的方法中将其传递回来。 It will call reset() first, and will throw the indicated exception for trying to use an already closed resource. 它将首先调用reset() ,并将抛出指示的异常,以尝试使用已经关闭的资源。

If you want to debug this, I'd recommend just creating an instance of your custom analyzer, and call analyzer.tokenStream to get the stream for debugging output. 如果要调试此功能,建议您仅创建自定义分析器的实例,然后调用analyzer.tokenStream以获得用于调试输出的流。 If you really need to debug by iterating through the filter instance, rather than the analyzer, you'll need to build a separate one, rather than consuming the stream in createComponents . 如果确实需要通过遍历过滤器实例而不是分析器进行调试,则需要构建一个单独的实例,而不是在createComponents使用流。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM