简体   繁体   English

如何计算文字中的单词并追加到词典中?

[英]How to count words in text and append in dictionary?

I'm trying to make a dictionary of frequency of words in a text but for some reason extra characters print out (I'm not sure if this is my text or if it's my code) and it doesn't successfully print out the lines or words that contain the invalid symbol! 我正在尝试制作文本中单词出现频率的字典,但由于某些原因,多余的字符会打印出来(我不确定这是我的文本还是代码),并且无法成功打印出行或包含无效符号的单词! This is the code I have: 这是我的代码:

 def parse_documentation(filename):
    filename=open(filename, "r") 
    lines = filename.read(); 
    invalidsymbols=["`","~","!", "@","#","$"]
    for line in lines: 
        for x in invalidsymbols:
            if x in line: 
                print(line) 
                print(x) 
                print(line.replace(x, "")) 
                freq={}
            for word in line:
                count=counter(word)
        freq[word]=count
    return freq

Your code has several flaws. 您的代码有几个缺陷。 I will not solve all of them but point you in the right direction. 我不会解决所有问题,但会指出正确的方向。

Firstly, read reads the whole file as a string. 首先, read将整个文件读取为字符串。 I don't think that's your intention here. 我认为这不是您的意图。 Use readlines() instead to get all lines in the file as a list. 使用readlines()代替将文件中的所有行作为列表。

def parse_documentation(filename):
    filename=open(filename, "r") 
    lines = filename.readlines(); # returns a list of all lines in file
    invalidsymbols=["`","~","!", "@","#","$"]
    freq = {} # declare this OUTSIDE of your loop.
    for line in lines:
        for letter in line:
            if letter in invalidsymbols:
                print(letter) 
                line = line.replace(letter, ""))
        print line #this should print the line without invalid symbols.

        words = line.split() # Now get the words.

        for word in line:
            count=counter(word)
            # ... Do your counter stuff here ...

    return freq

Second, I'm highly suspicious of the workings of your counter method. 其次,我对您的counter方法的工作方式非常怀疑。 If your intention is to count the number of words, you could adopt this strategy: 如果您打算计算字数,则可以采用以下策略:

  1. Check if word is in freq . 检查word是否在freq
  2. If it is not in in freq , add it and map it to 1. Otherwise, increment the number that the word was previously mapped to. 如果不在freq ,请将其添加并映射到1。否则,增加该word先前映射到的数字。

This should set you on the right track. 这将使您走上正确的道路。

Check this, it might be what you want. 选中此选项,可能就是您想要的。 BTW, your code is not correct Python code. 顺便说一句,您的代码不是正确的Python代码。 There are many issues there. 那里有很多问题。

from collections import Counter

def parse_documentation(filename):
    with open(filename,"r") as fin:
        lines = fin.read()
    #for sym in ["`","~","!","@","#","$"]: lines = lines.replace(sym,'')
    lines = lines.translate(None,"`~!@#$")    #thanks to @gnibbler's comment
    freq = Counter(lines.split())
    return freq

text file: 文本文件:

this is a text. text is that. @this #that
$this #!that is those

Results: 结果:

Counter({'this': 3, 'is': 3, 'that': 2, 'a': 1, 'that.': 1, 'text': 1, 'text.': 1, 'those': 1})

you might need. 您可能需要。 line.split(' ') else the for loop will loop through letters. line.split(' ')否则for循环将遍历字母。

....
for word in line.split(' '):
    count=counter(word)
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM