简体   繁体   English

如何用gensim过滤掉语料库中低tf-idf的单词?

[英]How to filter out words with low tf-idf in a corpus with gensim?

I am using gensim for some NLP task. 我正在使用gensim进行一些NLP任务。 I've created a corpus from dictionary.doc2bow where dictionary is an object of corpora.Dictionary . 我已经从dictionary.doc2bow创建了一个语料库,其中dictionarycorpora.Dictionary一个对象。 Now I want to filter out the terms with low tf-idf values before running an LDA model. 现在我想在运行LDA模型之前过滤掉低tf-idf值的术语。 I looked into the documentation of the corpus class but cannot find a way to access the terms. 我查看了语料库类的文档 ,但找不到访问这些术语的方法。 Any ideas? 有任何想法吗? Thank you. 谢谢。

Say your corpus is the following: 说你的语料库如下:

corpus = [dictionary.doc2bow(doc) for doc in documents]

After running TFIDF you can retrieve a list of low value words: 运行TFIDF后,您可以检索低值单词列表:

tfidf = TfidfModel(corpus, id2word=dictionary)

low_value = 0.2
low_value_words = []
for bow in corpus:
    low_value_words += [id for id, value in tfidf[bow] if value < low_value]

Then filter them out of the dictionary before running LDA: 然后在运行LDA之前将它们从字典中过滤掉:

dictionary.filter_tokens(bad_ids=low_value_words)

Recompute the corpus now that low value words are filtered out: 现在重新计算语料库,过滤掉低值词:

new_corpus = [dictionary.doc2bow(doc) for doc in documents]

This is old, but if you wanted to look at in on a per document level do something like this: 这是旧的,但如果您想查看每个文档级别,请执行以下操作:

#same as before
dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)

#filter low value words
low_value = 0.025

for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    new_bow = [b for b in bow if b[0] not in low_value_words]

    #reassign        
    corpus[i] = new_bow

This is essentially same as previous answers, but additionally handles words which are missing in tf-idf representation due to 0 score (terms present in all documents). 这与先前的答案基本相同,但另外处理由于0分(在所有文档中存在的术语)而在tf-idf表示中缺失的单词。 Previous answer did not filter such terms and they still appeared in the final corpus. 以前的答案没有过滤这些术语,它们仍然出现在最终的语料库中。

#Same as before

dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)


#Filter low value words and also words missing in tfidf models.

low_value = 0.025

for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    tfidf_ids = [id for id, value in tfidf[bow]]
    bow_ids = [id for id, value in bow]
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf socre 0 will be missing

    new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]  

#reassign        
corpus[i] = new_bow

Say you have a document tfidf_doc which generated by gensim's TfidfModel() with the corresponding bag of words document bow_doc , and you want to filter words that have tfidf value lower then cut_percent % of words in this document, you can call tfidf_filter(tfidf_doc, cut_percent) , then it will return a cut version of tfidf_doc : 假设你有一个由gensim的TfidfModel()生成的文件tfidf_doc ,其中包含相应的文字袋文件bow_doc ,并且你想要过滤cut_percent值低于本文档中cut_percent %的单词,你可以调用tfidf_filter(tfidf_doc, cut_percent) ,然后它将返回tfidf_doc的剪切版本:

def tfidf_filter(tfidf_doc, cut_percent):

    sorted_by_tfidf = sorted(tfidf_doc, key=lambda tup: tup[1])
    cut_value = sorted_by_tfidf[int(len(sorted_by_tfidf)*cut_percent)][1]

    #print('before cut:',len(tfidf_doc))

    #print('cut value:', cut_value)
    for i in range(len(tfidf_doc)-1, -1, -1):
        if tfidf_doc[i][1] < cut_value:
            tfidf_doc.pop(i)

    #print('after cut:',len(tfidf_doc))

    return tfidf_doc

Then you want to filter the document bow_doc by the resulting tfidf_doc , jsut call filter_bow_by_tfidf(bow_doc, tfidf_doc) , it will return cut version of bow_doc : 然后你想用生成的tfidf_doc过滤文件bow_doctfidf_doc调用filter_bow_by_tfidf(bow_doc, tfidf_doc) ,它将返回bow_doc剪切版本:

def filter_bow_by_tfidf(bow_doc, tfidf_doc):
    bow_idx = len(bow_doc)-1
    tfidf_idx = len(tfidf_doc)-1

    #print('before :', len(bow_doc))

    while True:
        if bow_idx < 0: break

        if tfidf_idx < 0:
            #print('pop2 :', bow_doc.pop(bow_idx))
            bow_doc.pop(bow_idx)
            bow_idx -= 1
        if bow_doc[bow_idx][0] > tfidf_doc[tfidf_idx][0]:
            #print('pop1 :', bow_doc.pop(bow_idx))
            bow_doc.pop(bow_idx)
            bow_idx -= 1
        if bow_doc[bow_idx][0] == tfidf_doc[tfidf_idx][0]:
            #print('keep :', bow_doc[bow_idx])
            bow_idx -= 1
            tfidf_idx -= 1

    #print('after :', len(bow_doc))

    return bow_doc

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM