简体   繁体   English

Python的NLTK字频

[英]Python's NLTK word frequency in sents

I'm sorry, but it's difficult for me: I have a code with basic frequency for some text, which represents output in "most_common" mode. 抱歉,这对我来说很困难:我有一个基本频率为某些文本的代码,它表示“ most_common”模式下的输出。 But it represents in words. 但是它用语言表示。

def sym(senten):
    stopwords = nltk.corpus.stopwords.words("english")
    V = [",", ".", "'", "(", ")", '"', "'", ":", "it", "may", "I", "the", "but", "two", "this", "since", "for", "whether", "and", "?", "if", "even", "Mr.", "also", "at", "p."]
    content = [w for w in senten if w not in stopwords and w not in V]
    fd = nltk.FreqDist(content)
    fdc = fd.most_common(75)
    return fdc

For further analysis I need frequency code, which represents me in output in sents(sentences). 为了进行进一步的分析,我需要频率代码,该频率代码在输出(句子)中代表我。 So, output must show me sents, selected by frequency of words, which locate in it. 因此,输出必须显示我的邮件,并根据其中的单词频率来选择。

I had some idea to realize it with "tabulate". 我有一些想法可以通过“制表”来实现。 There is a code (for example): 有一个代码(例如):

S= ["proposition", "modus", "logic", "2"] #The most frequent words( for example)
cfd = nltk.ConditionalFreqDist(
    (senten, S)
    for senten in senten
    for S in senten)
print cfd.tabulate(conditions = senten,
             samples=S)

It's works, but there are too many pointless data about sentences without frequent words. 它是可行的,但是关于没有频繁单词的句子,有太多毫无意义的数据。

I'll gratitude for your ideas, which could resolve my problem. 我感谢您的想法,这可以解决我的问题。

Do it in two steps. 分两步完成。 You already have code that finds the most frequent words, so that's good. 您已经有找到最常用单词的代码,这很好。 Now build an index (a dictionary) that will tell you, for each word, which sentences contain it. 现在建立一个索引(一个字典),该索引将告诉您每个单词包含哪些句子。 So the keys in this dictionary should be words, and the value will be a list of whole sentences -- basically the opposite from how you tried to do it. 因此,该词典中的键应该是单词,而值将是整个句子的列表-基本上与您尝试的相反。 You'll add each sentence multiple times (don't worry it doesn't actually get copied, so it's quite efficient). 您将多次添加每个句子(不用担心它实际上不会被复制,因此效率很高)。

This dictionary doesn't need to count anything-- you'll just be looking words up. 这本词典不需要计算任何内容-您只需查找单词即可。 So you can use an ordinary dict , or use collections.defaultdict for convenience. 因此,为方便起见,您可以使用普通dict ,也可以使用collections.defaultdict The final step will be to obtain a list of the most common words using your current function, and for each such word you can simply request all sentences that contain it. 最后一步将是使用当前函数获取最常用单词的列表,对于每个此类单词,您只需请求包含该单词的所有句子即可。 Clear enough? 足够清楚吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM