如何在多个标记化单词列表中计算10个最常见的单词

Question

I have a data-set with a lot of lists of lists of tokenized words. 我有一个数据集，其中包含很多标记词列表。 for example: 例如：

['apple','banana','tomato']
['tomato','tree','pikachu']

I have around 40k lists like those, and I want to count the 10 most common words from all of the 40k lists together. 我有大约40k个这样的列表，并且我想将所有40k列表中的10个最常用的词加在一起。

Anyone have any idea? 有人知道吗

Answer 1

You could flatten the nested list with itertools.chain and take the most common words using Counter and its most_common method: 您可以使用itertools.chain展平嵌套列表，并使用Counter及其most_common方法获取最常用的单词：

from itertools import chain
from collections import Counter

l = ['apple','banana','tomato'],['tomato','tree','pikachu']

Counter(chain(*l)).most_common(10)
# [('tomato', 2), ('apple', 1), ('banana', 1), ('tree', 1), ('pikachu', 1)]

Answer 2

a solution using dictionary 使用字典的解决方案

arrays = [['apple','banana','tomato'],['tomato','tree','pikachu']]
d = dict()
for array in arrays:
    for item in array:
        if item in d:
            d[item] += 1
        else:
            d[item] = 1
print(sorted( ((v,k) for k,v in d.items()), reverse=True)[:10])

Output 产量

[('tomato', 2), ('apple', 1), ('banana', 1), ('tree', 1), ('pikachu', 1)]

Answer 3

I suggest to merge your lists to a one list eg 我建议将您的列表合并为一个列表，例如

list_of_lists = [['apple','banana','tomato'],['tomato','tree','pikachu']]

import itertools
flat_list = list(itertools.chain(*list_of_lists))

Then use Counter to calculate your tokens and pick just the top 10 然后使用Counter计算您的代币并仅选择前10名

from collections import Counter
counter_of_flat_list = Counter(flat_list)

print(counter_of_flat_list.most_common(10)) # print top 10

[('tomato', 2), ('apple', 1), ('banana', 1), ('tree', 1), ('pikachu', 1)] [（'tomato'，2），（'apple'，1），（'banana'，1），（'tree'，1），（'pikachu'，1）]]

如何在多个标记化单词列表中计算10个最常见的单词

问题描述

3 个解决方案

解决方案1
4 2019-05-15 15:06:01

解决方案2
0 2019-05-15 15:08:01

解决方案3
0 2019-05-15 15:08:19

如何在多个标记化单词列表中计算10个最常见的单词

问题描述

3 个解决方案

解决方案1 4 2019-05-15 15:06:01

解决方案2 0 2019-05-15 15:08:01

解决方案3 0 2019-05-15 15:08:19

解决方案1
4 2019-05-15 15:06:01

解决方案2
0 2019-05-15 15:08:01

解决方案3
0 2019-05-15 15:08:19