计算lucene指数中的词频

Question

有人可以帮我找到所有lucene指数中的单词频率
例如，如果文档A有3个单词（B），而文档C有2个单词，我想要一个返回5的方法，显示所有lucene索引中单词（B）的频率

Answer 1

这被多次询问：

Answer 2

假设您使用Lucene 3.x：

IndexReader ir = IndexReader.open(dir); 
TermDocs termDocs = ir.termDocs(new Term("your_field", "your_word"));
int count = 0;
while (termDocs.next()) {
   count += termDocs.freq();
}

一些评论：

dir是Lucene Directory类的实例。 它的创建因RAM和文件系统索引而异，有关详细信息，请参阅Lucene文档。

"your_filed"是一个搜索术语的文件。 如果您有多个字段，则可以为所有字段运行过程，或者，当您索引文件时，可以创建特殊字段（例如“_content”）并保留所有其他字段的连接值。

Answer 3

使用lucene 3.4

简单的计算方法，但你需要两个数组： - /

int[] docs = new int[1000];
int[] freqs = new int[1000];
int count = indexReader.termDocs(term).read(docs, freqs);

注意：如果你用于读取，你就不能再使用next（），因为在read（）之后你已经在枚举结束时：

int[] docs = new int[1000];
int[] freqs = new int[1000];
TermDocs td = indexReader.termDocs(term);
int count = td.read(docs, freqs);
while (td.next()){ // always false, already at the end of the enumartion
}

计算lucene指数中的词频

问题描述

3 个解决方案

解决方案1
9 2010-11-12 19:47:40

解决方案2
3 2010-11-12 19:48:21

解决方案3
1 2013-07-17 11:12:27

使用lucene 3.4

计算lucene指数中的词频

问题描述

3 个解决方案

解决方案1 9 2010-11-12 19:47:40

解决方案2 3 2010-11-12 19:48:21

解决方案3 1 2013-07-17 11:12:27

使用lucene 3.4

解决方案1
9 2010-11-12 19:47:40

解决方案2
3 2010-11-12 19:48:21

解决方案3
1 2013-07-17 11:12:27