簡體   English   中英

將單詞從文件讀入字典

[英]Read words from file into dictionary

因此,在我們的作業中,我的教授希望我們逐行閱讀一個文本文件,然后逐個單詞閱讀,然后創建一個字典,計算每個單詞出現的頻率。 這是我現在擁有的:

wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
    for line in f:
        for word in line.split():
            line = line.lower()
            word = word.strip(string.punctuation + string.digits)
            if word:
                wordcount[word] = line.count(word)
    return wordcount

發生的是,我的字典告訴我每個單詞在特定行中出現了多少,而當某些單詞在整個文本中多次出現時,我幾乎只剩下1。 我如何才能使字典來計算整個文本中的單詞,而不僅僅是一行?

問題是您每次都要重置它,此修復非常簡單:

wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
    for line in f:
        for word in line.split():
            line = line.lower()
            word = word.strip(string.punctuation + string.digits)
            if word:
                if word in wordcount:
                    wordcount[word] += line.count(word)
                else:
                    wordcount[word] = line.count(word)
    return wordcount

問題在這一行:

wordcount[word] = line.count(word)

每次執行該行時,當您希望添加時,無論wordcount[word]的值是什么,都將被line.count(word) 替換 嘗試將其更改為:

wordcount[word] = wordcount[word] + line.count(word)

這就是我要做的:

import string

wordcount = {}
with open('test.txt', 'r') as f:
    for line in f:
        line = line.lower() #I suppose you want boy and Boy to be the same word
        for word in line.split():
            #what if your word has funky punctuations chars next to it?
            word = word.translate(string.maketrans("",""), string.punctuation)
            #if it's already in the d increase the number
            try:
                wordcount[word] += 1
            #if it's not this is the first time we are adding it
            except:
                wordcount[word] = 1

print wordcount

祝好運!

如果您想查看另一種方法。 它並不是按照您的要求逐行和逐字逐句地進行的,但是您應該意識到collections模塊有時會非常有用。

from collections import Counter
# instantiate a counter element
c = Counter()
with open('myfile.txt', 'r') as f:
     for line in f:
         # Do all the cleaning you need here 
         c.update(line.lower().split())

# Get all the statistic you want, for example:
c.most_common(10)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM