![](/img/trans.png)
[英]using a Python dictionary to count the frequency of words, excluding a set of "stop words" that will be read from a second file
[英]Read words from file into dictionary
因此,在我們的作業中,我的教授希望我們逐行閱讀一個文本文件,然后逐個單詞閱讀,然后創建一個字典,計算每個單詞出現的頻率。 這是我現在擁有的:
wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
for line in f:
for word in line.split():
line = line.lower()
word = word.strip(string.punctuation + string.digits)
if word:
wordcount[word] = line.count(word)
return wordcount
發生的是,我的字典告訴我每個單詞在特定行中出現了多少,而當某些單詞在整個文本中多次出現時,我幾乎只剩下1。 我如何才能使字典來計算整個文本中的單詞,而不僅僅是一行?
問題是您每次都要重置它,此修復非常簡單:
wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
for line in f:
for word in line.split():
line = line.lower()
word = word.strip(string.punctuation + string.digits)
if word:
if word in wordcount:
wordcount[word] += line.count(word)
else:
wordcount[word] = line.count(word)
return wordcount
問題在這一行:
wordcount[word] = line.count(word)
每次執行該行時,當您希望添加時,無論wordcount[word]
的值是什么,都將被line.count(word)
替換 。 嘗試將其更改為:
wordcount[word] = wordcount[word] + line.count(word)
這就是我要做的:
import string
wordcount = {}
with open('test.txt', 'r') as f:
for line in f:
line = line.lower() #I suppose you want boy and Boy to be the same word
for word in line.split():
#what if your word has funky punctuations chars next to it?
word = word.translate(string.maketrans("",""), string.punctuation)
#if it's already in the d increase the number
try:
wordcount[word] += 1
#if it's not this is the first time we are adding it
except:
wordcount[word] = 1
print wordcount
祝好運!
如果您想查看另一種方法。 它並不是按照您的要求逐行和逐字逐句地進行的,但是您應該意識到collections模塊有時會非常有用。
from collections import Counter
# instantiate a counter element
c = Counter()
with open('myfile.txt', 'r') as f:
for line in f:
# Do all the cleaning you need here
c.update(line.lower().split())
# Get all the statistic you want, for example:
c.most_common(10)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.