简体繁体 English

获取 Lucene 中的词频

[英]Get term frequencies in Lucene

原文 2009-03-20 18:32:17 5 3 java/ full-text-search/ lucene

Is there a fast and easy way of getting term frequencies from a Lucene index, without doing it through the TermVectorFrequencies class, since that takes an awful lot of time for large collections?是否有一种快速简便的方法可以从 Lucene 索引中获取术语频率，而无需通过TermVectorFrequencies class 进行操作，因为对于大型 collections 来说这需要大量时间？

What I mean is, is there something like TermEnum which has not just the document frequency but term frequency as well?我的意思是，有没有像TermEnum这样的东西，它不仅有文档频率，还有词频？

UPDATE: Using TermDocs is way too slow.更新：使用 TermDocs 太慢了。

3 个解决方案

Use TermDocs to get the term frequency for a given document.使用TermDocs获取给定文档的词频。 Like the document frequency, you get the term documents from an IndexReader , using the term of interest.与文档频率一样，您可以使用感兴趣的术语从IndexReader中获取术语文档。

You won't find a faster method than TermDocs without losing some generality.您找不到比TermDocs更快的方法而不失一些一般性。 TermDocs reads directly from the ".frq" file in an index segment, where each term frequency is listed in document order. TermDocs直接从索引段中的“.frq”文件中读取，其中每个词频按文档顺序列出。

If that's "too slow", make sure that you've optimized your index to merge multiple segments into a single segment.如果这“太慢”，请确保您已优化索引以将多个段合并为一个段。 Iterate over the documents in order (skips are alright, but you can't jump back and forth in the document list efficiently).按顺序遍历文档（跳过没问题，但不能有效地在文档列表中来回跳转）。

Your next step might be additional processing to create an even more specialized file structure that leaves out the SkipData .您的下一步可能是进行额外处理以创建更专业的文件结构，从而SkipData 。 Personally I would look for a better algorithm to achieve my objective, or provide better hardware—lots of memory, either to hold a RAMDirectory , or to give to the OS for use on its own file-caching system.就我个人而言，我会寻找一种更好的算法来实现我的目标，或者提供更好的硬件——大量的 memory，或者保存一个RAMDirectory ，或者提供给操作系统以在其自己的文件缓存系统上使用。

The trunk version of Lucene (to be 4.0, eventually) now exposes the totalTermFreq() for each term from the TermsEnum. Lucene 的主干版本（最终为 4.0）现在公开了来自 TermsEnum 的每个术语的 totalTermFreq()。 This is the total number of times this term appeared in all content (but, like docFreq, does not take into account deletions).这是该术语在所有内容中出现的总次数（但与 docFreq 一样，不考虑删除）。

TermDocs gives the TF of a given term in each document that contains the term. TermDocs在包含该术语的每个文档中给出给定术语的 TF。 You can get the DF by iterating through each <document, frequency> pair and counting the number of pairs, although TermEnums should be faster.您可以通过遍历每个 <document, frequency> 对并计算对数来获得 DF，尽管 TermEnums 应该更快。 IndexReader has a termDocs(Term) method that returns a TermDocs for the given Term and index. IndexReader有一个termDocs(Term) 方法，该方法返回给定 Term 和索引的 TermDocs。