简体   繁体   中英

Find the most common words in list of dictionaries in python

I want to know how to get most common words from a list of dictionaries. The structure example as below.

listDict = [{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer, and a smiling, relaxed developer sip lattes and calmly discuss how Flex is going to make customers happy and shorten the workday.'},
{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer'},
{'longDescription': 'Is going to make customers happy and shorten the workday.'},
{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer, and a smiling.'}]

The desired result is something above, in order by most common word:

[('word1', 7), 
('word2', 7), 
('word3', 3), 
('word4', 3), 
('word5', 3), 
('word6', 2), 
('word7', 2)]

Here's an interesting approach: You can count individual items using Counter and then sum them.

from collections import Counter
import re

counts = sum((Counter(filter(None, re.split('\W+', v.lower())))
                    for x in listDict for v in x.values()), Counter())

print(counts.most_common(5))
[('a', 8), ('and', 5), ('the', 5), ('marketer', 3), ('designer', 3)]

Regex Details

\W+   # one or more characters that are not alphabets   

re.split splits the text based on the regex pattern. filter will remove empty strings (this part thanks to Ajax1234).

If it is reasonable to expect that each dictionary in the list has the same key (ie. 'longDescription' in the example you give), there are just a few steps that would be necessary. While looping through each item in the list, you will need to clean the string (str.lower()), split the string into words (str.split()), and then add each word to a word count dictionary. Fortunately, each of these steps can be accomplished with built in functions in python.

from collections import defaultdict

# A defaultdict is nice because if a key is not already defined, the key
# will be added to the dictionary, and the value will go to a default. 
# Because we specify the default type to be an integer, that default value
# will be 0.
wordCount = defaultdict(int)
for dictionary in listDict:
    clean_str = dictionary['longDescription'].lower()
    words = clean_str.split(' ')
    for word in words:
        word_count[word] += 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM