gensim.corpora.Dictionary 是否保存了词频？

Question

gensim.corpora.Dictionary 是否保存了词频？

从gensim.corpora.Dictionary ，可以获得单词的文档频率（即特定单词出现在多少文档中）：

from nltk.corpus import brown
from gensim.corpora import Dictionary

documents = brown.sents()
brown_dict = Dictionary(documents)

# The 100th word in the dictionary: 'these'
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')

[出]：

The word "these" appears in 1213 documents

还有filter_n_most_frequent(remove_n)函数可以删除第 n 个最频繁的标记：

filter_n_most_frequent(remove_n)过滤掉出现在文档中的“remove_n”最频繁的标记。

修剪后，缩小单词 id 中产生的间隙。

注意：由于间隔缩小，调用该函数前后，同一个词可能会有不同的词id！

filter_n_most_frequent函数是否根据文档频率或filter_n_most_frequent删除了第 n 个最频繁的函数？

如果是后者，是否有某种方法可以访问gensim.corpora.Dictionary对象中单词的gensim.corpora.Dictionary ？

Answer 1

不， gensim.corpora.Dictionary不保存词频。 您可以在此处查看源代码。 该类只存储以下成员变量：

    self.token2id = {}  # token -> tokenId
    self.id2token = {}  # reverse mapping for token2id; only formed on request, to save memory
    self.dfs = {}  # document frequencies: tokenId -> in how many documents this token appeared

    self.num_docs = 0  # number of documents processed
    self.num_pos = 0  # total number of corpus positions
    self.num_nnz = 0  # total number of non-zeroes in the BOW matrix

这意味着类中的所有内容都将频率定义为文档频率，而不是术语频率，因为后者永远不会全局存储。 这适用于filter_n_most_frequent(remove_n)以及所有其他方法。

Answer 2

我有同样的简单问题。 似乎该词的频率被隐藏并且无法在对象中访问。 不知道为什么它使测试和验证变得痛苦。 我所做的是将字典导出为文本..

dictionary.save_as_text('c:\\research\\gensimDictionary.txt')

在该文本文件中，它们有三列。例如，这里有单词“summit”、“summon”和“sumo”

关键词频率

10 首脑会议 1227

3658 召唤 118

8477相扑40

我找到了一个解决方案 .cfs 是词频.. 见https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary

print(str(dictionary[10]), str(dictionary.cfs[10]))

首脑会议 1227

简单

Answer 3

你能做这样的事情吗？

dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(sent) for sent in documents]
vocab = list(dictionary.values()) #list of terms in the dictionary
vocab_tf = [dict(i) for i in corpus]
vocab_tf = list(pd.DataFrame(vocab_tf).sum(axis=0)) #list of term frequencies

Answer 4

字典没有，但语料库有。

# Term frequency
# load dictionary
dictionary = corpora.Dictionary.load('YourDict.dict')
# load corpus
corpus = corpora.MmCorpus('YourCorpus.mm')
CorpusTermFrequency = array([[(dictionary[id], freq) for id, freq in cp] for cp in corpus])

Answer 5

从弓形表示而不是创建密集向量来计算词频的一种有效方法。

corpus = [dictionary.doc2bow(sent) for sent in documents]
vocab_tf={}
for i in corpus:
    for item,count in dict(i).items():
        if item in vocab_tf:
            vocab_tf[item]+=count
        else:
            vocab_tf[item] = count

Answer 6

gensim.corpora.Dictionary现在将词频存储在其cfs属性中。 您可以在此处查看文档。

参考文献
收集频率：token_id -> 文档中包含此令牌的实例数。
类型：dict of (int, int)

gensim.corpora.Dictionary 是否保存了词频？

问题描述

6 个解决方案

解决方案1
7 已采纳 2017-10-17 05:51:36

解决方案2
3 2020-02-02 15:15:06

解决方案3
2 2017-12-28 17:01:34

解决方案4
0 2018-05-23 13:46:18

解决方案5
0 2018-08-28 11:00:05

解决方案6
0 2021-04-27 16:16:19

gensim.corpora.Dictionary 是否保存了词频？

问题描述

6 个解决方案

解决方案1 7 已采纳 2017-10-17 05:51:36

解决方案2 3 2020-02-02 15:15:06

解决方案3 2 2017-12-28 17:01:34

解决方案4 0 2018-05-23 13:46:18

解决方案5 0 2018-08-28 11:00:05

解决方案6 0 2021-04-27 16:16:19

解决方案1
7 已采纳 2017-10-17 05:51:36

解决方案2
3 2020-02-02 15:15:06

解决方案3
2 2017-12-28 17:01:34

解决方案4
0 2018-05-23 13:46:18

解决方案5
0 2018-08-28 11:00:05

解决方案6
0 2021-04-27 16:16:19