繁体   English   中英

将单词从文件读入字典

[英]Read words from file into dictionary

因此,在我们的作业中,我的教授希望我们逐行阅读一个文本文件,然后逐个单词阅读,然后创建一个字典,计算每个单词出现的频率。 这是我现在拥有的:

wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
    for line in f:
        for word in line.split():
            line = line.lower()
            word = word.strip(string.punctuation + string.digits)
            if word:
                wordcount[word] = line.count(word)
    return wordcount

发生的是,我的字典告诉我每个单词在特定行中出现了多少,而当某些单词在整个文本中多次出现时,我几乎只剩下1。 我如何才能使字典来计算整个文本中的单词,而不仅仅是一行?

问题是您每次都要重置它,此修复非常简单:

wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
    for line in f:
        for word in line.split():
            line = line.lower()
            word = word.strip(string.punctuation + string.digits)
            if word:
                if word in wordcount:
                    wordcount[word] += line.count(word)
                else:
                    wordcount[word] = line.count(word)
    return wordcount

问题在这一行:

wordcount[word] = line.count(word)

每次执行该行时,当您希望添加时,无论wordcount[word]的值是什么,都将被line.count(word) 替换 尝试将其更改为:

wordcount[word] = wordcount[word] + line.count(word)

这就是我要做的:

import string

wordcount = {}
with open('test.txt', 'r') as f:
    for line in f:
        line = line.lower() #I suppose you want boy and Boy to be the same word
        for word in line.split():
            #what if your word has funky punctuations chars next to it?
            word = word.translate(string.maketrans("",""), string.punctuation)
            #if it's already in the d increase the number
            try:
                wordcount[word] += 1
            #if it's not this is the first time we are adding it
            except:
                wordcount[word] = 1

print wordcount

祝好运!

如果您想查看另一种方法。 它并不是按照您的要求逐行和逐字逐句地进行的,但是您应该意识到collections模块有时会非常有用。

from collections import Counter
# instantiate a counter element
c = Counter()
with open('myfile.txt', 'r') as f:
     for line in f:
         # Do all the cleaning you need here 
         c.update(line.lower().split())

# Get all the statistic you want, for example:
c.most_common(10)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM