簡體 English 中英

語料庫 gensim 中的熱門術語

[英]top terms in corpus gensim

原文 2018-06-14 22:53:55 9 2 python/ gensim/ counting/ corpus

我正在使用 python package Gensim 進行聚類，我首先根據給定文本的標記化和詞形還原句子創建了一個字典，然后使用該字典使用以下代碼創建了語料庫：

 mydict = corpora.Dictionary(LemWords)
 corpus = [mydict.doc2bow(text) for text in LemWords]

我知道語料庫會包含單詞的 ID 以及它們在每個文檔中的頻率。 我想知道整個語料庫中給定單詞的頻率，以找到語料庫中的熱門術語。 我想知道是否有任何方法可以返回整個語料庫中術語的頻率

2 個解決方案

你可以試試這個：

import itertools
from collections import defaultdict

total_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_count[word_id] += word_count

# Top ten words
sorted(total_count.items(), key=lambda x: x[1], reverse=True)[:10]

按照您的代碼：

 mydict = corpora.Dictionary(LemWords)
 corpus = [mydict.doc2bow(text) for text in LemWords]
    
 # word frequency by doc showing the word, if you want
 wordfreq_doc = [{mydict[idw]: freq for idw, freq in cp}
                 for cp in corpus]

 # word frequency for corpus
 wordfreq_all = Counter()
 for fwd in freq_w_doc: wordfreq_all.update(fwd)
 wordfreq_all = wordfreq_all.most_common()

我兩個都用。 第一個是連接我的字典數據框。 然后，我可以查看 LSA 是否運行良好，例如。 第二，我用它來查找停用詞和文本平衡。

為什么在我轉換了語料庫后，`gensim`中的tf-idf模型會丟棄這些術語並計數？

[英]Why did the tf-idf model in `gensim` throws away the terms and counts after i transform the corpus?

來自稀疏矩陣的 gensim 語料庫

[英]gensim corpus from sparse matrix

在Gensim中理解LDA轉化語料庫

[英]Understanding LDA Transformed Corpus in Gensim

Python Gensim LDAMallet CalledProcessError 與大型語料庫（使用小型語料庫運行良好）

[英]Python Gensim LDAMallet CalledProcessError with large corpus (runs fine with small corpus)

將gensim相似度計算限制為語料庫的子集

[英]Restrict gensim similarity calculations to a subset of a corpus

通過 Wikipedia 構建語料庫：ModuleNotFoundError: No module named 'gensim'

[英]Build the corpus by Wikipedia: ModuleNotFoundError: No module named 'gensim'

Gensim上的問題從字典創建語料庫

[英]Questions on Gensim create corpus from dictionary

如何解決載入Gensim語料庫中的unpickling錯誤？ -蟒蛇

[英]How to resolve the unpicklingerror in loading gensim corpus? - python

我應該如何訓練布朗語料庫中的gensim

[英]How should I train gensim on Brown corpus

如何使用gensim從語料庫中提取短語

[英]How to extract phrases from corpus using gensim

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 為什么在我轉換了語料庫后，`gensim`中的tf-idf模型會丟棄這些術語並計數？來自稀疏矩陣的 gensim 語料庫在Gensim中理解LDA轉化語料庫 Python Gensim LDAMallet CalledProcessError 與大型語料庫（使用小型語料庫運行良好）將gensim相似度計算限制為語料庫的子集通過 Wikipedia 構建語料庫：ModuleNotFoundError: No module named 'gensim' Gensim上的問題從字典創建語料庫如何解決載入Gensim語料庫中的unpickling錯誤？ -蟒蛇我應該如何訓練布朗語料庫中的gensim 如何使用gensim從語料庫中提取短語

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM