[英]Java | Lucene | TokenStream fields cannot be stored
In the application I receive a text on which I apply filters, and I would like to store this filtered result into lucene Document
object. 在应用程序中,我收到一个文本,在该文本上应用过滤器,并且我希望将此过滤后的结果存储到Lucene
Document
对象中。 I do not care about the original text. 我不在乎原始文本。
String stringToProcess = "...";
TokenStream stream = analyzer.tokenStream(null, new StringReader(stringToProcess));
TokenStream procStream = new CustomFilter(stream, opts);
Document luceneDocument = new Document();
FieldType ft = new FieldType(TextField.TYPE_STORED);
ft.setOmitNorms(false);
ft.setStoreTermVectors(true);
luceneDocument.add(new Field("content", procStream, ft));
This throws: 这引发:
Exception in thread "main" java.lang.IllegalArgumentException: TokenStream fields cannot be stored
If I change the TextField.TYPE_STORED
to TYPE_NOT_STORED
there's no exception. 如果将
TextField.TYPE_STORED
更改为TYPE_NOT_STORED
,也不例外。 However, the content of the field is null
. 但是,该字段的内容为
null
。 There's a constructor for Field
which clearly accepts TokenStream
object. Field
有一个构造函数,它显然接受TokenStream
对象。
I can manually extract the tokens from the procStream
with .incrementToken()
and .getAttribute(ChatTermAttribute.class)
. 我可以使用
.incrementToken()
和.getAttribute(ChatTermAttribute.class)
从procStream
手动提取令牌。
My question: How can I pass the TokenStream
to the Field object? 我的问题:如何将
TokenStream
传递给Field对象?
You can't just pass in a TokenStream and store the field. 您不能只传递TokenStream并存储字段。
A TokenStream is a stream of analyzed, indexable tokens. TokenStream是经过分析的可索引令牌的流。 The stored content of a field is the pre-analysis string.
字段的存储内容是预分析字符串。 You are not providing that string to the field, so it doesn't have anything suitable to be stored, thus the exception.
您没有将该字符串提供给字段,因此它没有任何适合存储的内容,因此是例外。
Instead, it would be more typical to set the Analyzer
in the IndexWriterConfig
, and let it handle analyzing the field for you. 取而代之的是,将
Analyzer
设置在IndexWriterConfig
,然后让它为您分析该字段将更为典型。 I'm guessing the reason you are doing it this way instead of letting the IndexWriter handle it is because you want to add that CustomFilter
to an out-of-the-box analyzer. 我猜想您这样做而不是让IndexWriter处理的原因是因为您想要将该
CustomFilter
添加到开箱即用的分析器中。 Instead, just create your own custom Analyzer
. 相反,只需创建自己的自定义
Analyzer
。 Analyzers are easy. 分析仪很简单。 Just copy the source of the analyzer you want to use, and add your custom filter to the chain in
createComponents
. 只需复制要使用的分析器的源,然后将自定义过滤器添加到
createComponents
的链中即可。 Say your using StandardAnalyzer, then you'd change the incrementToken method you copied to look like this: 说您使用StandardAnalyzer,然后您将更改复制的增量令牌方法,如下所示:
@Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(src);
tok = new LowerCaseFilter(tok);
tok = new StopFilter(tok, stopwords);
tok = new CustomFilter(tok, opts); //Just adding this line
return new TokenStreamComponents(src, tok) {
@Override
protected void setReader(final Reader reader) {
src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
Then you can create your field like: 然后,您可以像下面那样创建字段:
new Field("content", stringToProcess, ft);
Okay, so I've assumed this is a bit of an XY problem . 好的,所以我假设这是一个XY问题 。 With the caveat that creating a custom analyzer is very likely the better solution, you actually can pass a TokenStream to the Field and store it as well, you just need to provide the string to store as well as the tokenstream.
需要说明的是,创建自定义分析器很可能是更好的解决方案,实际上,您可以将TokenStream传递给Field并存储它,只需要提供存储字符串和tokenstream。 That would look something like this:
看起来像这样:
Field myField = new Field("content", stringToProcess, ft);
myField.setContentStream(procStream);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.