[英]How to find frequency of ten most common words in a file?
我正在 Python 上寫一個 function ,它將文本文件的名稱(作為字符串)作為輸入。 function 應該首先確定每個單詞在文件中出現的次數。 稍后,我將制作一個條形圖,顯示文件中最常見的十個單詞的頻率,每個條形旁邊是第二個條形,其高度是 Zipf 定律預測的頻率。 我已經有一些圖表代碼,但我需要幫助來查找文本文件中最常見的單詞。
def zipf_graph(text_file):
import string
file = open(text_file, encoding = 'utf8')
text = file.read()
file.close()
#the following strips and removes punctuation and makes the words lowercase
punc = string.punctuation + '’”—⎬⎪“⎫'
new_text = text
for char in punc:
new_text = new_text.replace(char,'')
new_text = new_text.lower()
text_split = new_text.split()
我被困在這里,我試圖在列表中找到最常見的字符串,但我不確定從這里到 go 的位置,以下是我嘗試過的:
words = text_split
most_common = max(words, key = words.count)
# print(most_common)
我還想添加以下代碼,因為它被建議幫助
# Sorting a list by frequency
# Assumes you have your elements as (word, frequency) tuples
# (Useful for the zipf function)
words = [('the', 1), ('and', 1), ('test',2)]
sorted(words, key = lambda x: x[1], reverse = True)
# "Sorting" a dictionary by frequency
# Assumes you have your elements as word:frequency
# (Useful for the zipf function)
words = dict()
words['the'] = 1
words['and'] = 1
words['test'] = 2
# This returns a list of just the most common words without their frequencies
most_common_words = sorted(words, key = words.get, reverse = True)
# print(most_common_words)
# We can go back to the dictionary to get the frequencies
for word in most_common_words:
print(word, words[word])
zipf_graph('fortune.txt') #name of the file I chose to use
我建議您使用collections
中的Counter
。
from collections import Counter
text_split = ["a", "b", "c", "a", "c", "d", "a", "d", "b"]
word_and_freq = Counter(text_split)
top = word_and_freq.most_common(2)
print(top)
有趣的是,這會返回您想要的格式。
[("a", 3), ("b", 2)]
您可以使用 nltk 庫:
import nltk
words = ['words', 'in', 'the', 'file']
fd = nltk.FreqDist(words)
fd.most_common(10)
將以以下格式給出值:
[('file', 1), ('words', 1), ('in', 1), ('the', 1)]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.