简体   繁体   English

如何按字母顺序查找最常用的单词?

[英]How to find the most frequent words in alphabetical order?

I am trying to find most frequent words in a text file in alphabetical order in this different program. 我试图在这个不同的程序中按字母顺序在文本文件中找到最常用的单词。

For example, the word: "that" is the most frequent word in the text file. 例如,单词:“ that”是文本文件中最常见的单词。 So, it should be printed first: "that #" 因此,应首先打印:“ that#”

It needs to be in this type of format as the program and as the answer below: 程序和下面的答案必须采用这种格式:

d = dict()

def counter_one():
    d = dict()
    word_file = open('gg.txt')
    for line in word_file:
        word = line.strip().lower()
        d = counter_two(word, d)
    return d

def counter_two(word, d):
    d = dict()
    word_file = open('gg.txt')
    for line in word_file:
        if word not in d:
            d[word] = 1
        else:
            d[word] + 1
    return d

def diction(d):
    for key, val in d.iteritems():
        print key, val

counter_one()
diction(d)

It should run something like this in the shell: 它应该在shell中运行如下代码:

>>>
Words in text: ###
Frequent Words: ###
that 11
the 11
we 10
which 10
>>>

One easy way to get frequency counts is to use the Counter class in the builtin collections module. 一种简单的获取频率计数的方法是在内置收集模块中使用Counter类 It allows you to pass in a list of words and it will automatically count them all and map each word to its frequency. 它允许您传递单词列表,它将自动对所有单词进行计数并将每个单词映射到其频率。

from collections import Counter
frequencies = Counter()
with open('gg.txt') as f:
  for line in f:
    frequencies.update(line.lower().split())

I used the lower() function to avoid counting "the" and "The" separately. 我使用了lower()函数来避免分别计算“ the”和“ The”。

Then you can output them in frequency order with frequencies.most_common() or frequencies.most_common(n) if you only want the top n . 然后,如果只想要顶部n则可以按频率顺序输出它们,带有frequencies.most_common()frequencies.most_common(n)

If you want to sort the resulting list by frequencies and then alphabetically for elements with the same frequencies, you can use the sorted builtin function with a key argument of lambda (x,y): (y,x) . 如果要按频率对结果列表进行排序,然后按字母顺序对具有相同频率的元素进行sorted ,则可以将已sorted内置函数与key参数lambda (x,y): (y,x) So, your final code to do this would be: 因此,执行此操作的最终代码将是:

from collections import Counter
frequencies = Counter()
with open('gg.txt') as f:
  for line in f:
    frequencies.update(line.lower().split())
most_frequent = sorted(frequencies.most_common(4), key=lambda (x,y): (y,x))
for (word, count) in most_frequent:
  print word, count

Then the output will be 然后输出将是

that 11
the 11
we 10
which 10

You can do this simpler using collection's Counter . 您可以使用collection的Counter简化此操作。 First, count the words, then sort by the number of appearances of each word AND the word itself: 首先,对单词进行计数,然后按每个单词的出现次数和单词本身进行排序:

from collections import Counter

# Load the file and extract the words
lines = open("gettysburg_address.txt").readlines()
words = [ w for l in lines for w in l.rstrip().split() ]
print 'Words in text:', len(words)

# Use counter to get the counts
counts = Counter( words )

# Sort the (word, count) tuples by the count, then the word itself,
# and output the k most frequent
k = 4
print 'Frequent words:'
for w, c in sorted(counts.most_common(k), key=lambda (w, c): (c, w), reverse=True):
    print '%s %s' % (w, c)

Output: 输出:

Words in text: 278
Frequent words:
that 13
the 9
we 8
to 8

Why do you keep re-opening the file and creating new dictionaries? 您为什么继续重新打开文件并创建新词典? What does your code need to do? 您的代码需要做什么?

create a new empty dictionary to store words {word: count}
open the file
work through each line (word) in the file
    if the word is already in the dictionary
        increment count by one
    if not
        add to dictionary with count 1

Then you can easily get the number of words 然后,您可以轻松获得字数

len(dictionary)

and the n most common words with their counts n最常见的单词及其数量

sorted(dictionary.items(), key=lambda x: x[1], reverse=True)[:n]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM