简体   繁体   English

将单词从文件读入字典

[英]Read words from file into dictionary

so in our assignment my professor would like us to read in a text file line by line, then word by word, then create a dictionary counting the frequency of each word appearing. 因此,在我们的作业中,我的教授希望我们逐行阅读一个文本文件,然后逐个单词阅读,然后创建一个字典,计算每个单词出现的频率。 Here's what I have for now: 这是我现在拥有的:

wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
    for line in f:
        for word in line.split():
            line = line.lower()
            word = word.strip(string.punctuation + string.digits)
            if word:
                wordcount[word] = line.count(word)
    return wordcount

What happens is that my dictionary tells me how many of each word appears in a particular line, leaving me with mostly 1s when some words show up in the entire text many times. 发生的是,我的字典告诉我每个单词在特定行中出现了多少,而当某些单词在整个文本中多次出现时,我几乎只剩下1。 How can I get my dictionary to count words from the entire text, not just a line? 我如何才能使字典来计算整个文本中的单词,而不仅仅是一行?

The problem is you are resetting it every time, the fix is quite simple: 问题是您每次都要重置它,此修复非常简单:

wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
    for line in f:
        for word in line.split():
            line = line.lower()
            word = word.strip(string.punctuation + string.digits)
            if word:
                if word in wordcount:
                    wordcount[word] += line.count(word)
                else:
                    wordcount[word] = line.count(word)
    return wordcount

The problem is in this line: 问题在这一行:

wordcount[word] = line.count(word)

Every time that line executes, whatever the value of wordcount[word] was is getting replaced by line.count(word) when you want it to be added . 每次执行该行时,当您希望添加时,无论wordcount[word]的值是什么,都将被line.count(word) 替换 Try changing it to: 尝试将其更改为:

wordcount[word] = wordcount[word] + line.count(word)

This is how I would do it: 这就是我要做的:

import string

wordcount = {}
with open('test.txt', 'r') as f:
    for line in f:
        line = line.lower() #I suppose you want boy and Boy to be the same word
        for word in line.split():
            #what if your word has funky punctuations chars next to it?
            word = word.translate(string.maketrans("",""), string.punctuation)
            #if it's already in the d increase the number
            try:
                wordcount[word] += 1
            #if it's not this is the first time we are adding it
            except:
                wordcount[word] = 1

print wordcount

Good luck! 祝好运!

In case you want to see another way to do this. 如果您想查看另一种方法。 It's not exactly line by line and word by word as you have requested, but you should be aware of the collections module which could be very useful sometimes. 它并不是按照您的要求逐行和逐字逐句地进行的,但是您应该意识到collections模块有时会非常有用。

from collections import Counter
# instantiate a counter element
c = Counter()
with open('myfile.txt', 'r') as f:
     for line in f:
         # Do all the cleaning you need here 
         c.update(line.lower().split())

# Get all the statistic you want, for example:
c.most_common(10)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM