![](/img/trans.png)
[英]using a Python dictionary to count the frequency of words, excluding a set of "stop words" that will be read from a second file
[英]Read words from file into dictionary
因此,在我们的作业中,我的教授希望我们逐行阅读一个文本文件,然后逐个单词阅读,然后创建一个字典,计算每个单词出现的频率。 这是我现在拥有的:
wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
for line in f:
for word in line.split():
line = line.lower()
word = word.strip(string.punctuation + string.digits)
if word:
wordcount[word] = line.count(word)
return wordcount
发生的是,我的字典告诉我每个单词在特定行中出现了多少,而当某些单词在整个文本中多次出现时,我几乎只剩下1。 我如何才能使字典来计算整个文本中的单词,而不仅仅是一行?
问题是您每次都要重置它,此修复非常简单:
wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
for line in f:
for word in line.split():
line = line.lower()
word = word.strip(string.punctuation + string.digits)
if word:
if word in wordcount:
wordcount[word] += line.count(word)
else:
wordcount[word] = line.count(word)
return wordcount
问题在这一行:
wordcount[word] = line.count(word)
每次执行该行时,当您希望添加时,无论wordcount[word]
的值是什么,都将被line.count(word)
替换 。 尝试将其更改为:
wordcount[word] = wordcount[word] + line.count(word)
这就是我要做的:
import string
wordcount = {}
with open('test.txt', 'r') as f:
for line in f:
line = line.lower() #I suppose you want boy and Boy to be the same word
for word in line.split():
#what if your word has funky punctuations chars next to it?
word = word.translate(string.maketrans("",""), string.punctuation)
#if it's already in the d increase the number
try:
wordcount[word] += 1
#if it's not this is the first time we are adding it
except:
wordcount[word] = 1
print wordcount
祝好运!
如果您想查看另一种方法。 它并不是按照您的要求逐行和逐字逐句地进行的,但是您应该意识到collections模块有时会非常有用。
from collections import Counter
# instantiate a counter element
c = Counter()
with open('myfile.txt', 'r') as f:
for line in f:
# Do all the cleaning you need here
c.update(line.lower().split())
# Get all the statistic you want, for example:
c.most_common(10)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.