简体   繁体   English

如何在多个标记化单词列表中计算10个最常见的单词

[英]How do I count the 10 most common words in a multiple lists of tokenized words

I have a data-set with a lot of lists of lists of tokenized words. 我有一个数据集,其中包含很多标记词列表。 for example: 例如:

['apple','banana','tomato']
['tomato','tree','pikachu']

I have around 40k lists like those, and I want to count the 10 most common words from all of the 40k lists together. 我有大约40k个这样的列表,并且我想将所有40k列表中的10个最常用的词加在一起。

Anyone have any idea? 有人知道吗

You could flatten the nested list with itertools.chain and take the most common words using Counter and its most_common method: 您可以使用itertools.chain展平嵌套列表,并使用Counter及其most_common方法获取最常用的单词:

from itertools import chain
from collections import Counter

l = ['apple','banana','tomato'],['tomato','tree','pikachu']

Counter(chain(*l)).most_common(10)
# [('tomato', 2), ('apple', 1), ('banana', 1), ('tree', 1), ('pikachu', 1)]

a solution using dictionary 使用字典的解决方案

arrays = [['apple','banana','tomato'],['tomato','tree','pikachu']]
d = dict()
for array in arrays:
    for item in array:
        if item in d:
            d[item] += 1
        else:
            d[item] = 1
print(sorted( ((v,k) for k,v in d.items()), reverse=True)[:10])

Output 产量

[('tomato', 2), ('apple', 1), ('banana', 1), ('tree', 1), ('pikachu', 1)]

I suggest to merge your lists to a one list eg 我建议将您的列表合并为一个列表,例如

list_of_lists = [['apple','banana','tomato'],['tomato','tree','pikachu']]

import itertools
flat_list = list(itertools.chain(*list_of_lists))

Then use Counter to calculate your tokens and pick just the top 10 然后使用Counter计算您的代币并仅选择前10名

from collections import Counter
counter_of_flat_list = Counter(flat_list)

print(counter_of_flat_list.most_common(10)) # print top 10

[('tomato', 2), ('apple', 1), ('banana', 1), ('tree', 1), ('pikachu', 1)] [('tomato',2),('apple',1),('banana',1),('tree',1),('pikachu',1)]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM