[英]How do I count the 10 most common words in a multiple lists of tokenized words
[英]How do I find the most common words in multiple separate texts?
真的有点简单的问题,但我似乎无法破解它。 我有一个字符串,格式如下:
["category1",("data","data","data")]
["category2", ("data","data","data")]
我打电话给不同类别的帖子,我想从数据部分获得最常用的单词。 所以我尝试过:
from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)
for cat, text2 in posts:
tokens = wordpunct_tokenize(text2)
for token in tokens:
if token in freq_dict:
freq_dict[token] += 1
else:
freq_dict[token] = 1
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top
但是,这会给我字符串中的PER字符。
我需要一个通用的顶级单词列表。
但是,如果我从for循环中取出打印顶部,它只会给我最后一篇文章的结果。
有没有人有想法?
这是一个范围问题。 此外,您不需要初始化defaultdict
的元素,因此这简化了您的代码:
试试这样:
posts = [["category1",("data1 data2 data3")],["category2", ("data1 data3 data5")]]
from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)
for cat, text2 in posts:
tokens = wordpunct_tokenize(text2)
for token in tokens:
freq_dict[token] += 1
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top
正如预期的那样,这是产出
['data1', 'data3', 'data5', 'data2']
结果是。
如果你真的有类似的东西
posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]
作为输入,您不需要wordpunct_tokenize()
因为输入数据已经被标记化。 然后,以下将工作:
posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]
from collections import defaultdict
freq_dict = defaultdict(int)
for cat, tokens in posts:
for token in tokens:
freq_dict[token] += 1
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top
它还输出预期的结果:
['data1', 'data3', 'data5', 'data2']
为什么不使用Counter ?
In [30]: from collections import Counter
In [31]: data=["category1",("data","data","data")]
In [32]: Counter(data[1])
Out[32]: Counter({'data': 3})
In [33]: Counter(data[1]).most_common()
Out[33]: [('data', 3)]
from itertools import chain
from collections import Counter
from nltk.tokenize import wordpunct_tokenize
texts=["a quick brown car", "a fast yellow rose", "a quick night rider", "a yellow officer"]
print Counter(chain.from_iterable(wordpunct_tokenize(x) for x in texts)).most_common(3)
输出:
[('a', 4), ('yellow', 2), ('quick', 2)]
正如您在Counter.most_common的文档中所看到的,返回的列表已经过排序。
要使用您的代码,您可以这样做
texts = (x[1] for x in posts)
或者你可以做
... wordpunct_tokenize(x[1]) for x in texts ...
如果你的帖子实际上是这样的:
posts=[("category1",["a quick brown car", "a fast yellow rose"]), ("category2",["a quick night rider", "a yellow officer"])]
你可以摆脱类别:
texts = list(chain.from_iterable(x[1] for x in posts))
( texts
将是['a quick brown car', 'a fast yellow rose', 'a quick night rider', 'a yellow officer']
)
然后,您可以在此答案顶部的片段中使用它。
只需更改您的代码以允许处理帖子,然后获取顶部字词:
from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)
for cat, text2 in posts:
tokens = wordpunct_tokenize(text2)
for token in tokens:
freq_dict[token] += 1
# get top after all posts have been processed.
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.