[英]Find the most common words in list of dictionaries in python
我想知道如何从词典列表中获得最常用的单词。 结构示例如下。
listDict = [{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer, and a smiling, relaxed developer sip lattes and calmly discuss how Flex is going to make customers happy and shorten the workday.'},
{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer'},
{'longDescription': 'Is going to make customers happy and shorten the workday.'},
{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer, and a smiling.'}]
所需的结果在上面,按最常用的词排序:
[('word1', 7),
('word2', 7),
('word3', 3),
('word4', 3),
('word5', 3),
('word6', 2),
('word7', 2)]
这是一种有趣的方法:您可以使用Counter
对单个项目进行Counter
,然后对其sum
。
from collections import Counter
import re
counts = sum((Counter(filter(None, re.split('\W+', v.lower())))
for x in listDict for v in x.values()), Counter())
print(counts.most_common(5))
[('a', 8), ('and', 5), ('the', 5), ('marketer', 3), ('designer', 3)]
正则表达式详细信息
\W+ # one or more characters that are not alphabets
re.split
根据正则表达式模式分割文本。 filter
将删除空字符串(这要归功于Ajax1234)。
如果可以合理预期列表中的每个字典都具有相同的键(例如,在您给出的示例中为“ longDescription”),那么只需执行几个步骤。 在遍历列表中的每个项目时,您需要清理字符串(str.lower()),将字符串拆分为单词(str.split()),然后将每个单词添加到单词计数字典中。 幸运的是,每个步骤都可以使用python中的内置函数来完成。
from collections import defaultdict
# A defaultdict is nice because if a key is not already defined, the key
# will be added to the dictionary, and the value will go to a default.
# Because we specify the default type to be an integer, that default value
# will be 0.
wordCount = defaultdict(int)
for dictionary in listDict:
clean_str = dictionary['longDescription'].lower()
words = clean_str.split(' ')
for word in words:
word_count[word] += 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.