[英]Lucene Customize TokenStream
I am using Lucene to count words (see example bellow). 我正在使用Lucene数词(请参见下面的示例)。
My question is how can I set my own filters in Lucene? 我的问题是如何在Lucene中设置自己的过滤器? For example to add my custom StopFilter, ShingleFilter, etc. 例如添加我的自定义StopFilter,ShingleFilter等。
I suppose that some token stream filter(s) is already being used since Hello, hello and HELLO are converted to "hello". 我想自从Hello,hello和HELLO转换为“ hello”以来,已经在使用某些令牌流过滤器。
public class CountWordsExample {
public static void main(String[] args) throws IOException {
RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(
Version.LUCENE_47, new StandardAnalyzer(Version.LUCENE_47)));
Document document = new Document();
document.add(new TextField("foo", "Hello hello how are you", Store.YES));
document.add(new TextField("foo", "hello how are you", Store.YES));
document.add(new TextField("foo", "HELLO", Store.YES));
writer.addDocument(document);
writer.commit();
writer.close(true);
// ShingleFilter shingle = new ShingleFilter(input);
IndexReader indexReader = DirectoryReader.open(directory);
Bits liveDocs = MultiFields.getLiveDocs(indexReader);
Fields fields = MultiFields.getFields(indexReader);
for (String field : fields) {
TermsEnum termEnum = MultiFields.getTerms(indexReader, field)
.iterator(null);
BytesRef bytesRef;
while ((bytesRef = termEnum.next()) != null) {
if (termEnum.seekExact(bytesRef)) {
DocsEnum docsEnum = termEnum.docs(liveDocs, null);
if (docsEnum != null) {
int doc;
while ((doc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
System.out
.println(bytesRef.utf8ToString()
+ " in doc " + doc + ": "
+ docsEnum.freq());
}
}
}
}
}
for (String field : fields) {
TermsEnum termEnum = MultiFields.getTerms(indexReader, field)
.iterator(null);
BytesRef bytesRef;
while ((bytesRef = termEnum.next()) != null) {
int freq = indexReader.docFreq(new Term(field, bytesRef));
System.out.println(bytesRef.utf8ToString() + " in " + freq
+ " documents");
}
}
}
} }
Output: 输出:
hello in doc 0: 4
how in doc 0: 2
you in doc 0: 2
hello in 1 documents
how in 1 documents
you in 1 documents
So the answer was quite straight forward. 因此答案很简单。 The way how to define my own token processing is to define my own analyzer. 定义自己的令牌处理的方法是定义自己的分析器。 For example: 例如:
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.util.Version;
public class NGramAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
TokenStream f = new StandardTokenizer(Version.LUCENE_47, reader);
f = new LowerCaseFilter(Version.LUCENE_47, f);
Tokenizer source = new StandardTokenizer(Version.LUCENE_47, reader);
return new TokenStreamComponents(source, f);
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.