Java | Lucene | TokenStream字段无法存储

Question

In the application I receive a text on which I apply filters, and I would like to store this filtered result into lucene Document object. 在应用程序中，我收到一个文本，在该文本上应用过滤器，并且我希望将此过滤后的结果存储到Lucene Document对象中。 I do not care about the original text. 我不在乎原始文本。

String stringToProcess = "...";
TokenStream stream = analyzer.tokenStream(null, new StringReader(stringToProcess));
TokenStream procStream = new CustomFilter(stream, opts);

Document luceneDocument = new Document();
FieldType ft = new FieldType(TextField.TYPE_STORED);
ft.setOmitNorms(false);
ft.setStoreTermVectors(true);
luceneDocument.add(new Field("content", procStream, ft));

This throws: 这引发：

Exception in thread "main" java.lang.IllegalArgumentException: TokenStream fields cannot be stored

If I change the TextField.TYPE_STORED to TYPE_NOT_STORED there's no exception. 如果将TextField.TYPE_STORED更改为TYPE_NOT_STORED ，也不例外。 However, the content of the field is null . 但是，该字段的内容为null 。 There's a constructor for Field which clearly accepts TokenStream object. Field有一个构造函数，它显然接受TokenStream对象。

I can manually extract the tokens from the procStream with .incrementToken() and .getAttribute(ChatTermAttribute.class) . 我可以使用.incrementToken()和.getAttribute(ChatTermAttribute.class)从procStream手动提取令牌。

My question: How can I pass the TokenStream to the Field object? 我的问题：如何将TokenStream传递给Field对象？

Answer 1

You can't just pass in a TokenStream and store the field. 您不能只传递TokenStream并存储字段。

A TokenStream is a stream of analyzed, indexable tokens. TokenStream是经过分析的可索引令牌的流。 The stored content of a field is the pre-analysis string. 字段的存储内容是预分析字符串。 You are not providing that string to the field, so it doesn't have anything suitable to be stored, thus the exception. 您没有将该字符串提供给字段，因此它没有任何适合存储的内容，因此是例外。

Instead, it would be more typical to set the Analyzer in the IndexWriterConfig , and let it handle analyzing the field for you. 取而代之的是，将Analyzer设置在IndexWriterConfig ，然后让它为您分析该字段将更为典型。 I'm guessing the reason you are doing it this way instead of letting the IndexWriter handle it is because you want to add that CustomFilter to an out-of-the-box analyzer. 我猜想您这样做而不是让IndexWriter处理的原因是因为您想要将该CustomFilter添加到开箱即用的分析器中。 Instead, just create your own custom Analyzer . 相反，只需创建自己的自定义Analyzer 。 Analyzers are easy. 分析仪很简单。 Just copy the source of the analyzer you want to use, and add your custom filter to the chain in createComponents . 只需复制要使用的分析器的源，然后将自定义过滤器添加到createComponents的链中即可。 Say your using StandardAnalyzer, then you'd change the incrementToken method you copied to look like this: 说您使用StandardAnalyzer，然后您将更改复制的增量令牌方法，如下所示：

@Override
protected TokenStreamComponents createComponents(final String fieldName) {
  final StandardTokenizer src = new StandardTokenizer();
  src.setMaxTokenLength(maxTokenLength);
  TokenStream tok = new StandardFilter(src);
  tok = new LowerCaseFilter(tok);
  tok = new StopFilter(tok, stopwords);
  tok = new CustomFilter(tok, opts); //Just adding this line
  return new TokenStreamComponents(src, tok) {
    @Override
    protected void setReader(final Reader reader) {
      src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
      super.setReader(reader);
    }
  };
}

Then you can create your field like: 然后，您可以像下面那样创建字段：

new Field("content", stringToProcess, ft);

Okay, so I've assumed this is a bit of an XY problem . 好的，所以我假设这是一个XY问题。 With the caveat that creating a custom analyzer is very likely the better solution, you actually can pass a TokenStream to the Field and store it as well, you just need to provide the string to store as well as the tokenstream. 需要说明的是，创建自定义分析器很可能是更好的解决方案，实际上，您可以将TokenStream传递给Field并存储它，只需要提供存储字符串和tokenstream。 That would look something like this: 看起来像这样：

Field myField = new Field("content", stringToProcess, ft);
myField.setContentStream(procStream);

Java | Lucene | TokenStream字段无法存储

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-11-06 16:29:13

Java | Lucene | TokenStream字段无法存储

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-11-06 16:29:13

解决方案1
2 已采纳 2017-11-06 16:29:13