计算词和词组频率的Python nltk

Question

我正在使用NLTK并尝试将单词短语计数到特定文档的某个长度以及每个短语的频率。 我将字符串标记为获取数据列表。

from nltk.util import ngrams
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.collocations import *


data = ["this", "is", "not", "a", "test", "this", "is", "real", "not", "a", "test", "this", "is", "this", "is", "real", "not", "a", "test"]

bigrams = ngrams(data, 2)

bigrams_c = {}
for b in bigrams:
    if b not in bigrams_c:
        bigrams_c[b] = 1
    else:
        bigrams_c[b] += 1

上面的代码给出和输出如下：

(('is', 'this'), 1)
(('test', 'this'), 2)
(('a', 'test'), 3)
(('this', 'is'), 4)
(('is', 'not'), 1)
(('real', 'not'), 2)
(('is', 'real'), 2)
(('not', 'a'), 3)

这是我正在寻找的部分内容。

我的问题是，是否有更方便的方法来说明长度为4或5的短语而不重复此代码只更改计数变量？

Answer 1

既然你标记了这个nltk ，下面是如何使用nltk的方法来实现它，这些方法比标准python集合中的方法有更多的功能。

from nltk import ngrams, FreqDist
all_counts = dict()
for size in 2, 3, 4, 5:
    all_counts[size] = FreqDist(ngrams(data, size))

字典all_counts每个元素都是ngram频率的字典。 例如，您可以获得五个最常见的三元组：

all_counts[3].most_common(5)

Answer 2

是的，不要运行这个循环，使用collections.Counter(bigrams) pandas.Series(bigrams).value_counts() collections.Counter(bigrams)或pandas.Series(bigrams).value_counts()来计算pandas.Series(bigrams).value_counts()的计数。

计算词和词组频率的Python nltk

问题描述

2 个解决方案

解决方案1
12 已采纳 2016-11-19 13:22:26

解决方案2
2 2016-11-18 04:14:01

计算词和词组频率的Python nltk

问题描述

2 个解决方案

解决方案1 12 已采纳 2016-11-19 13:22:26

解决方案2 2 2016-11-18 04:14:01

解决方案1
12 已采纳 2016-11-19 13:22:26

解决方案2
2 2016-11-18 04:14:01