简体   繁体   中英

Sort a list and get the most frequent words

I am new to python and working on trying to sort the list and get the 3 most frequent words. I am this far yet:

from collections import Counter

reader = open("longtext.txt",'r')
data = reader.read()
reader.close()
words = data.split() # Into a list
uniqe = sorted(set(words)) # Remove duplicate words and sort
for word in uniqe:
        print '%s: %s' %(word, words.count(word) ) # words.count counts the words.

This is my output, how can I sort the most frequent words and list only first, second and third frequent word?:

2: 2
3.: 1
3?: 1
New: 1
Python: 5
Read: 1
and: 1
between: 1
choosing: 1
or: 2
to: 1    

You can use collections.counter 's most_common method, like this

from collections import Counter
with open("longtext.txt", "r") as reader:
    c = Counter(line.rstrip() for line in reader)
print c.most_common(3)

Quoting example from the official documentation,

>>> Counter('abracadabra').most_common(3)
[('a', 5), ('r', 2), ('b', 2)]

If you want to print them like you have shown in the question, you can simply iterate the most common elements and print them like this

for word, count in c.most_common(3):
    print "{}: {}".format(word, count)

Note: Counter approach is better than sorting approach because the runtime of Counter will be in O(N) whereas the sorting takes O(N * log N) in the worst case.

Alongside the most_common that is the pythonic way as an alternative you can use sorted :

>>> d={'2': 2,'3.': 1,'3?': 1,'New': 1,'Python': 5,'Read': 1,'and': 1,'between': 1,'choosing': 1,'or': 2,'to': 1} 
>>> print sorted(d.items(),key=lambda x :x[1])[-3:]

>>> [('2', 2), ('or', 2), ('Python', 5)]

Or use heapq.nlargest . But note that the nlargest() function is most appropriate if you are trying to find a relatively small number of items. :

import heapq
print heapq.nlargest(3, d.items(),key=lambda x :x[1])
[('Python', 5), ('2', 2), ('or', 2)]

Another Approach, more pythonic .. !

This is another approach without using counter or count method. Hope this opens up more ideas.

#reader = open("longtext.txt",'r')
#data = reader.read()
#reader.close()
data  = 'aa sfds fsd f sd aa dfdsa dfdsa dfdsa sd sd sds ds dsd sdds sds sd sd sd sd sds sd sds'
words = data.split()
word_dic = {}
for word in words:
    try:
        word_dic[word] = word_dic[word]+1
    except KeyError:
        word_dic[word] = 1
print  sorted([(value, key) for (key,value) in word_dic.items()])[-3:]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM