[英]How do I count the 10 most common words in a multiple lists of tokenized words
I have a data-set with a lot of lists of lists of tokenized words. 我有一个数据集,其中包含很多标记词列表。 for example: 例如:
['apple','banana','tomato']
['tomato','tree','pikachu']
I have around 40k lists like those, and I want to count the 10 most common words from all of the 40k lists together. 我有大约40k个这样的列表,并且我想将所有40k列表中的10个最常用的词加在一起。
Anyone have any idea? 有人知道吗
You could flatten the nested list with itertools.chain
and take the most common words using Counter
and its most_common
method: 您可以使用itertools.chain
展平嵌套列表,并使用Counter
及其most_common
方法获取最常用的单词:
from itertools import chain
from collections import Counter
l = ['apple','banana','tomato'],['tomato','tree','pikachu']
Counter(chain(*l)).most_common(10)
# [('tomato', 2), ('apple', 1), ('banana', 1), ('tree', 1), ('pikachu', 1)]
a solution using dictionary 使用字典的解决方案
arrays = [['apple','banana','tomato'],['tomato','tree','pikachu']]
d = dict()
for array in arrays:
for item in array:
if item in d:
d[item] += 1
else:
d[item] = 1
print(sorted( ((v,k) for k,v in d.items()), reverse=True)[:10])
Output 产量
[('tomato', 2), ('apple', 1), ('banana', 1), ('tree', 1), ('pikachu', 1)]
I suggest to merge your lists to a one list eg 我建议将您的列表合并为一个列表,例如
list_of_lists = [['apple','banana','tomato'],['tomato','tree','pikachu']]
import itertools
flat_list = list(itertools.chain(*list_of_lists))
Then use Counter to calculate your tokens and pick just the top 10 然后使用Counter计算您的代币并仅选择前10名
from collections import Counter
counter_of_flat_list = Counter(flat_list)
print(counter_of_flat_list.most_common(10)) # print top 10
[('tomato', 2), ('apple', 1), ('banana', 1), ('tree', 1), ('pikachu', 1)] [('tomato',2),('apple',1),('banana',1),('tree',1),('pikachu',1)]]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.