对于Lucene（PyLucene）中的每个文档，获得TFIDF得分最高的N个术语

Question

I am currently using PyLucene but since there is no documentation for it, I guess a solution in Java for Lucene will also do (but if anyone has one in Python it would be even better). 我目前正在使用PyLucene，但是由于没有相关文档，我想Java的Lucene解决方案也可以（但是如果有人在Python中有解决方案，那就更好了）。

I am working with scientific publications and for now, I retrieve the keywords of those. 我正在与科学出版物合作，现在，我检索那些关键词。 However, for some documents there are simply no keywords. 但是，对于某些文档，根本没有关键字。 An alternative to this would be to get N words (5-8) with the highest TFIDF scores. 替代方法是获得TFIDF分数最高的N个单词（5-8）。

I am not sure how to do it, and also when . 我不确定该怎么做以及何时执行 。 By when, I mean : Do I have to tell Lucene at the stage of indexing to compute these values, of it is possible to do it when searching the index. 到什么时候，我的意思是：我必须在索引编制阶段告诉Lucene计算这些值，在搜索索引时可以做到这一点。

What I would like to have for each query would be something like this : 我希望每个查询都具有以下内容：

Query Ranking

Document1, top 5 TFIDF terms, Lucene score (default TFIDF)
Document2,     "       "    ,   "         "
...

What would also be possible is to first retrieve the ranking for the query, and then compute the top 5 TFIDF terms for each of these documents. 也有可能首先检索查询的排名，然后为每个这些文档计算前5个TFIDF术语。

Does anyone have an idea how shall I do this ? 有谁知道我该怎么做？

Answer 1

If a field is indexed , document frequencies can be retrieved with getTerms . 如果对字段建立索引，则可以使用getTerms检索文档频率。 If a field has stored term vectors , term frequencies can be retrieved with getTermVector . 如果字段存储了术语向量，则可以使用getTermVector检索术语频率。

I also suggest looking at MoreLikeThis , which uses tf*idf to create a query similar to the document, from which you can extract the terms. 我还建议查看MoreLikeThis ，它使用tf * idf来创建类似于文档的查询，您可以从中提取术语。

And if you'd like a more pythonic interface, that was my motivation for lupyne : 而且，如果您想使用更多的pythonic接口，那是我使用lupyne的动机：

from lupyne import engine
searcher = engine.IndexSearcher(<filepath>)
df = dict(searcher.terms(<field>, counts=True))
tf = dict(searcher.termvector(<docnum>, <field>, counts=True))
query = searcher.morelikethis(<docnum>, <field>)

Answer 2

After digging a bit in the mailing list, I ended up having what I was looking for. 在邮件列表中进行了一些研究之后，我终于找到了想要的东西。

Here is the method I came up with : 这是我想出的方法：

def getTopTFIDFTerms(docID, reader):
    termVector = reader.getTermVector(docID, "contents");
    termsEnumvar = termVector.iterator(None)
    termsref = BytesRefIterator.cast_(termsEnumvar)
    tc_dict = {}                     # Counts of each term
    dc_dict = {}                     # Number of docs associated with each term
    tfidf_dict = {}                  # TF-IDF values of each term in the doc
    N_terms = 0
    try:
        while (termsref.next()):
            termval = TermsEnum.cast_(termsref)
            fg = termval.term().utf8ToString()       # Term in unicode
            tc = termval.totalTermFreq()             # Term count in the doc

            # Number of docs having this term in the index
            dc = reader.docFreq(Term("contents", termval.term())) 
            N_terms = N_terms + 1 
            tc_dict[fg]=tc
            dc_dict[fg]=dc
    except:
        print 'error in term_dict'

    # Compute TF-IDF for each term
    for term in tc_dict:
        tf = tc_dict[term] / N_terms
        idf = 1 + math.log(N_DOCS_INDEX/(dc_dict[term]+1)) 
        tfidf_dict[term] = tf*idf

    # Here I get a representation of the sorted dictionary
    sorted_x = sorted(tfidf_dict.items(), key=operator.itemgetter(1), reverse=True)

    # Get the top 5 
    top5 = [i[0] for i in sorted_x[:5]] # replace 5 by TOP N

I am not sure why I have to cast the termsEnum as a BytesRefIterator , I got this from a thread in the mailing list which can be found here 我不确定为什么必须将termsEnum为BytesRefIterator ，我是从邮件列表中的一个线程获得的，可以在这里找到

Hope this will help :) 希望这会有所帮助:)

对于Lucene（PyLucene）中的每个文档，获得TFIDF得分最高的N个术语

问题描述

2 个解决方案

解决方案1
1 2016-08-17 01:16:03

解决方案2
1 已采纳 2016-08-17 10:34:18

对于Lucene（PyLucene）中的每个文档，获得TFIDF得分最高的N个术语

问题描述

2 个解决方案

解决方案1 1 2016-08-17 01:16:03

解决方案2 1 已采纳 2016-08-17 10:34:18

解决方案1
1 2016-08-17 01:16:03

解决方案2
1 已采纳 2016-08-17 10:34:18