Python的NLTK字频

Question

抱歉，这对我来说很困难：我有一个基本频率为某些文本的代码，它表示“ most_common”模式下的输出。 但是它用语言表示。

def sym(senten):
    stopwords = nltk.corpus.stopwords.words("english")
    V = [",", ".", "'", "(", ")", '"', "'", ":", "it", "may", "I", "the", "but", "two", "this", "since", "for", "whether", "and", "?", "if", "even", "Mr.", "also", "at", "p."]
    content = [w for w in senten if w not in stopwords and w not in V]
    fd = nltk.FreqDist(content)
    fdc = fd.most_common(75)
    return fdc

为了进行进一步的分析，我需要频率代码，该频率代码在输出（句子）中代表我。 因此，输出必须显示我的邮件，并根据其中的单词频率来选择。

我有一些想法可以通过“制表”来实现。 有一个代码（例如）：

S= ["proposition", "modus", "logic", "2"] #The most frequent words( for example)
cfd = nltk.ConditionalFreqDist(
    (senten, S)
    for senten in senten
    for S in senten)
print cfd.tabulate(conditions = senten,
             samples=S)

它是可行的，但是关于没有频繁单词的句子，有太多毫无意义的数据。

我感谢您的想法，这可以解决我的问题。

Answer 1

分两步完成。 您已经有找到最常用单词的代码，这很好。 现在建立一个索引（一个字典），该索引将告诉您每个单词包含哪些句子。 因此，该词典中的键应该是单词，而值将是整个句子的列表-基本上与您尝试的相反。 您将多次添加每个句子（不用担心它实际上不会被复制，因此效率很高）。

这本词典不需要计算任何内容-您只需查找单词即可。 因此，为方便起见，您可以使用普通dict ，也可以使用collections.defaultdict 。 最后一步将是使用当前函数获取最常用单词的列表，对于每个此类单词，您只需请求包含该单词的所有句子即可。 足够清楚吗？

Python的NLTK字频

问题描述

1 个解决方案

解决方案1
0 2015-09-03 16:30:04

Python的NLTK字频

问题描述

1 个解决方案

解决方案1 0 2015-09-03 16:30:04

解决方案1
0 2015-09-03 16:30:04