简体   繁体   English

在python词典列表中找到最常用的单词

[英]Find the most common words in list of dictionaries in python

I want to know how to get most common words from a list of dictionaries. 我想知道如何从词典列表中获得最常用的单词。 The structure example as below. 结构示例如下。

listDict = [{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer, and a smiling, relaxed developer sip lattes and calmly discuss how Flex is going to make customers happy and shorten the workday.'},
{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer'},
{'longDescription': 'Is going to make customers happy and shorten the workday.'},
{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer, and a smiling.'}]

The desired result is something above, in order by most common word: 所需的结果在上面,按最常用的词排序:

[('word1', 7), 
('word2', 7), 
('word3', 3), 
('word4', 3), 
('word5', 3), 
('word6', 2), 
('word7', 2)]

Here's an interesting approach: You can count individual items using Counter and then sum them. 这是一种有趣的方法:您可以使用Counter对单个项目进行Counter ,然后对其sum

from collections import Counter
import re

counts = sum((Counter(filter(None, re.split('\W+', v.lower())))
                    for x in listDict for v in x.values()), Counter())

print(counts.most_common(5))
[('a', 8), ('and', 5), ('the', 5), ('marketer', 3), ('designer', 3)]

Regex Details 正则表达式详细信息

\W+   # one or more characters that are not alphabets   

re.split splits the text based on the regex pattern. re.split根据正则表达式模式分割文本。 filter will remove empty strings (this part thanks to Ajax1234). filter将删除空字符串(这要归功于Ajax1234)。

If it is reasonable to expect that each dictionary in the list has the same key (ie. 'longDescription' in the example you give), there are just a few steps that would be necessary. 如果可以合理预期列表中的每个字典都具有相同的键(例如,在您给出的示例中为“ longDescription”),那么只需执行几个步骤。 While looping through each item in the list, you will need to clean the string (str.lower()), split the string into words (str.split()), and then add each word to a word count dictionary. 在遍历列表中的每个项目时,您需要清理字符串(str.lower()),将字符串拆分为单词(str.split()),然后将每个单词添加到单词计数字典中。 Fortunately, each of these steps can be accomplished with built in functions in python. 幸运的是,每个步骤都可以使用python中的内置函数来完成。

from collections import defaultdict

# A defaultdict is nice because if a key is not already defined, the key
# will be added to the dictionary, and the value will go to a default. 
# Because we specify the default type to be an integer, that default value
# will be 0.
wordCount = defaultdict(int)
for dictionary in listDict:
    clean_str = dictionary['longDescription'].lower()
    words = clean_str.split(' ')
    for word in words:
        word_count[word] += 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM