I am trying to find most frequent words in a text file in alphabetical order in this different program.
For example, the word: "that" is the most frequent word in the text file. So, it should be printed first: "that #"
It needs to be in this type of format as the program and as the answer below:
d = dict()
def counter_one():
d = dict()
word_file = open('gg.txt')
for line in word_file:
word = line.strip().lower()
d = counter_two(word, d)
return d
def counter_two(word, d):
d = dict()
word_file = open('gg.txt')
for line in word_file:
if word not in d:
d[word] = 1
else:
d[word] + 1
return d
def diction(d):
for key, val in d.iteritems():
print key, val
counter_one()
diction(d)
It should run something like this in the shell:
>>>
Words in text: ###
Frequent Words: ###
that 11
the 11
we 10
which 10
>>>
One easy way to get frequency counts is to use the Counter class in the builtin collections module. It allows you to pass in a list of words and it will automatically count them all and map each word to its frequency.
from collections import Counter
frequencies = Counter()
with open('gg.txt') as f:
for line in f:
frequencies.update(line.lower().split())
I used the lower()
function to avoid counting "the" and "The" separately.
Then you can output them in frequency order with frequencies.most_common()
or frequencies.most_common(n)
if you only want the top n
.
If you want to sort the resulting list by frequencies and then alphabetically for elements with the same frequencies, you can use the sorted
builtin function with a key
argument of lambda (x,y): (y,x)
. So, your final code to do this would be:
from collections import Counter
frequencies = Counter()
with open('gg.txt') as f:
for line in f:
frequencies.update(line.lower().split())
most_frequent = sorted(frequencies.most_common(4), key=lambda (x,y): (y,x))
for (word, count) in most_frequent:
print word, count
Then the output will be
that 11
the 11
we 10
which 10
You can do this simpler using collection's Counter
. First, count the words, then sort by the number of appearances of each word AND the word itself:
from collections import Counter
# Load the file and extract the words
lines = open("gettysburg_address.txt").readlines()
words = [ w for l in lines for w in l.rstrip().split() ]
print 'Words in text:', len(words)
# Use counter to get the counts
counts = Counter( words )
# Sort the (word, count) tuples by the count, then the word itself,
# and output the k most frequent
k = 4
print 'Frequent words:'
for w, c in sorted(counts.most_common(k), key=lambda (w, c): (c, w), reverse=True):
print '%s %s' % (w, c)
Output:
Words in text: 278
Frequent words:
that 13
the 9
we 8
to 8
Why do you keep re-opening the file and creating new dictionaries? What does your code need to do?
create a new empty dictionary to store words {word: count}
open the file
work through each line (word) in the file
if the word is already in the dictionary
increment count by one
if not
add to dictionary with count 1
Then you can easily get the number of words
len(dictionary)
and the n
most common words with their counts
sorted(dictionary.items(), key=lambda x: x[1], reverse=True)[:n]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.