简体   繁体   English

计算lucene指数中的词频

[英]counting the word frequency in lucene index

Can someone help me finding the word frequency in all lucene index 有人可以帮我找到所有lucene指数中的单词频率
for example if doc A has 3 number of word (B) and doc C has 2 of them, I'd like a method to return 5 showing the frequency of word (B) in all lucene index 例如,如果文档A有3个单词(B),而文档C有2个单词,我想要一个返回5的方法,显示所有lucene索引中单词(B)的频率

Assuming you work with Lucene 3.x: 假设您使用Lucene 3.x:

IndexReader ir = IndexReader.open(dir); 
TermDocs termDocs = ir.termDocs(new Term("your_field", "your_word"));
int count = 0;
while (termDocs.next()) {
   count += termDocs.freq();
}

Some comments: 一些评论:

dir is the instance of Lucene Directory class . dir是Lucene Directory类的实例。 It's creation differs for RAM and Filesystem indexes, see Lucene documentation for details. 它的创建因RAM和文件系统索引而异,有关详细信息,请参阅Lucene文档。

"your_filed" is a filed to search a term. "your_filed"是一个搜索术语的文件。 If you have multiple fields, you can run procedure for all of them or, alternatively, when you index your files, you can create special field (eg "_content") and keep there concatenated values of all other fields. 如果您有多个字段,则可以为所有字段运行过程,或者,当您索引文件时,可以创建特殊字段(例如“_content”)并保留所有其他字段的连接值。

using lucene 3.4 使用lucene 3.4

easy way to get the count, but you need two arrays :-/ 简单的计算方法,但你需要两个数组: - /

int[] docs = new int[1000];
int[] freqs = new int[1000];
int count = indexReader.termDocs(term).read(docs, freqs);

beware: if you would use for read you are not able to use next() any more, because after the read() you are already at the end of the enumeration: 注意:如果你用于读取,你就不能再使用next(),因为在read()之后你已经在枚举结束时:

int[] docs = new int[1000];
int[] freqs = new int[1000];
TermDocs td = indexReader.termDocs(term);
int count = td.read(docs, freqs);
while (td.next()){ // always false, already at the end of the enumartion
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM