簡體   English   中英

在python詞典列表中找到最常用的單詞

[英]Find the most common words in list of dictionaries in python

我想知道如何從詞典列表中獲得最常用的單詞。 結構示例如下。

listDict = [{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer, and a smiling, relaxed developer sip lattes and calmly discuss how Flex is going to make customers happy and shorten the workday.'},
{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer'},
{'longDescription': 'Is going to make customers happy and shorten the workday.'},
{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer, and a smiling.'}]

所需的結果在上面,按最常用的詞排序:

[('word1', 7), 
('word2', 7), 
('word3', 3), 
('word4', 3), 
('word5', 3), 
('word6', 2), 
('word7', 2)]

這是一種有趣的方法:您可以使用Counter對單個項目進行Counter ,然后對其sum

from collections import Counter
import re

counts = sum((Counter(filter(None, re.split('\W+', v.lower())))
                    for x in listDict for v in x.values()), Counter())

print(counts.most_common(5))
[('a', 8), ('and', 5), ('the', 5), ('marketer', 3), ('designer', 3)]

正則表達式詳細信息

\W+   # one or more characters that are not alphabets   

re.split根據正則表達式模式分割文本。 filter將刪除空字符串(這要歸功於Ajax1234)。

如果可以合理預期列表中的每個字典都具有相同的鍵(例如,在您給出的示例中為“ longDescription”),那么只需執行幾個步驟。 在遍歷列表中的每個項目時,您需要清理字符串(str.lower()),將字符串拆分為單詞(str.split()),然后將每個單詞添加到單詞計數字典中。 幸運的是,每個步驟都可以使用python中的內置函數來完成。

from collections import defaultdict

# A defaultdict is nice because if a key is not already defined, the key
# will be added to the dictionary, and the value will go to a default. 
# Because we specify the default type to be an integer, that default value
# will be 0.
wordCount = defaultdict(int)
for dictionary in listDict:
    clean_str = dictionary['longDescription'].lower()
    words = clean_str.split(' ')
    for word in words:
        word_count[word] += 1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM