简体   繁体   中英

How to extract the 10 most-frequent and 10 least-frequent of words in python?

I get an output after I run a few lines of codes with final line vocabulary . It gave me 46132 of different words and show me how many times each words appear in the documents.

I attached the screenshot below of the output. I'm not sure which type of the format the vocabulary is. I need to extract the 10 most frequent and 10 least frequent words appeared in the documents. I'm not sure how to do that probably because I don't know the format of the output whether is str or tuple .

在此处输入图像描述

Can I just use max(vocabulary) to get the most frequent word appeared in the document? sorted(vocabulary) and get the first 10 and last 10 as the 10 most frequent and 10 least frequent words appeared in the documents?

Getting the k most common words is trivial using the collections.Counter class:

>>> vocabulary = { 'apple': 7, 'ball': 1, 'car': 3, 'dog': 6, 'elf': 2 }
>>> from collections import Counter
>>> vocabulary = Counter(vocabulary)
>>> vocabulary.most_common(2)
[('apple', 7), ('dog', 6)]

Getting the least common words too is a bit trickier. The simplest way is probably to sort the dictionary's key/value pairs by value, and take a slice:

>>> sorted(vocabulary.items(), key=lambda x: x[1])[:2]
[('ball', 1), ('elf', 2)]

Since you need both, you might as well just sort once and take two slices; this way you don't need to use a Counter :

>>> sorted_vocabulary = sorted(vocabulary.items(), key=lambda x: x[1])
>>> most_common = sorted_vocabulary[-2:][::-1]
>>> least_common = sorted_vocabulary[:2] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM