使用 python 和 Counters 计算一个单词在文本中出现的次数

Question

I have a dictionary and a file with a lot of text (IMDB reviews).我有一本字典和一个包含大量文本的文件（IMDB 评论）。 I am trying to update the dictionary with the number of times it has seen a word.我试图用它看到一个词的次数来更新字典。 I can get it to count how many letters are in the text file but I need the dictionary to have a count of the words instead.我可以让它计算文本文件中有多少个字母，但我需要字典来计算单词数。 This is my current code:这是我当前的代码：

import glob
import codecs
word_counts = Counter() # I require this

for label in [POS_LABEL, NEG_LABEL]:
    for directory in [TRAIN_DIR, TEST_DIR]:
        for fn in glob.glob(directory + "/" + label + "/*txt"):
            doc = codecs.open(fn, 'r', 'utf8') # Open the file with UTF-8 encoding
            k = doc.read().split()
            print(k)

print(word_counts)

Here is the result of this:这是结果：

Counter({' ': 11507297, 'e': 6036511, 't': 4565527, 'a': 3979283, 'o': 3754514, 'i': 3654084, 's': 3342252, 'n': 3321375, 'r': 3003149, 'h': 2701649, 'l': 2194831, 'd': 1715235, 'c': 1351529, 'u': 1345564, 'm': 1317735, 'f': 1085018, 'y': 1031725, 'g': 1016124, 'w': 936183, 'b': 929635, 'p': 824208, '.': 650520, 'v': 617921, ',': 544818, 'k': 414662, 'I': 264480, "'": 263760, 'T': 220030, '/': 215720, '>': 202250, '<': 202094, '-': 132038, '"': 131858, 'A': 129582, 'S': 119546, 'B': 89001, 'x': 84005, 'M': 83700, 'H': 79548, 'C': 77492, 'D': 76272, 'j': 74962, ')': 71268, 'W': 70424, '(': 69585, 'E': 62149, 'R': 59387, 'L': 55859, 'O': 54998, 'N': 51707, 'P': 49662, '!': 49164, 'F': 47474, 'G': 47367, 'J': 41145, 'z': 40445, '0': 37357, 'q': 37098, '1': 35792, '?': 32338, 'V': 29518, 'K': 28837, 'Y': 21969, ':': 19800, '9': 19392, 'U': 17488, '2': 15978, '*': 13916, ';': 13375, '3': 11002, '5': 10457, '8': 8874, '4': 8342, '7': 8277, '&': 7714, '6': 6209, 'Z': 4490, 'é': 3337, 'Q': 2842, '\x96': 2529, 'X': 1957, '`': 1861, '$': 1617, '\x85': 1479, '_': 997, '%': 867, '+': 642, '#': 640, '=': 623, '\x97': 596, '´': 434, ']': 254, '’': 254, '[': 239, '~': 230, 'á': 208, '{': 192, '}': 192, '@': 181, 'è': 169, 'ö': 160, '–': 149, 'ó': 126, '\x91': 121, '£': 117, 'ü': 106, '\t': 106, 'í': 100, '^': 95, 'ä': 91, 'ç': 82, 'à': 80, 'ñ': 78, 'ô': 66, '¨': 64, 'ï': 58, '“': 57, '”': 55, '»': 55, '«': 53, 'ã': 48, 'â': 45, '|': 45, '\xa0': 44, '¡': 43, '½': 39, 'å': 36, 'ê': 36, '\\': 35, 'ë': 32, '\x84': 31, '·': 29, 'ú': 23, 'ý': 22, 'ø': 19, '\x8e': 18, '\x9e': 18, '‘': 18, '\x95': 17, '…': 16, '¦': 14, '§': 13, 'É': 10, 'ß': 10, 'î': 9, '\x80': 8, 'ð': 8, 'Æ': 8, 'Õ': 7, '\uf0b7': 7, 'Á': 6, 'ì': 6, 'æ': 6, 'Ü': 6, 'û': 6, 'ù': 6, 'ò': 5, '\xad': 5, 'Ö': 5, '、': 5, '\x08': 4, '°': 4, '®': 4, 'ō': 4, '¾': 4, 'Ã': 3, '¿': 3, 'À': 3, 'Å': 3, 'Ó': 3, '\x8d': 3, '¤': 2, 'Ê': 2, '₤': 2, 'Ä': 2, 'È': 2, 'Þ': 2, '，': 2, '¢': 2, 'º': 2, '▼': 2, '★': 2, '³': 1, '\x9a': 1, 'Ø': 1, 'Ï': 1, 'Â': 1, 'Ç': 1, 'Ð': 1, 'ı': 1, 'ğ': 1, '″': 1, '©': 1, 'ª': 1, '\x10': 1, 'Ż': 1, 'י': 1, 'ג': 1, 'א': 1, 'ל': 1, 'כ': 1, 'ר': 1, 'מ': 1, 'ו': 1, 'ן': 1, 'õ': 1})

Forgot to mention I tried something like this too:忘了提到我也尝试过这样的事情：

k = doc.read().split()
word_counts.update(k)

inside of the third for loop.在第三个 for 循环内。

I was told I was in the right direction:有人告诉我，我的方向是正确的：

if word_counts["movie"] == 61492

UPDATE:更新：

for label in [POS_LABEL, NEG_LABEL]:
    for directory in [TRAIN_DIR, TEST_DIR]:
        for fn in glob.glob(directory + "/" + label + "/*txt"):
            doc = codecs.open(fn, 'r', 'utf8') # Open the file with UTF-8 encoding
            k = doc.read().split()
            word_counts.update(k)


print(word_counts["movie"])

This is where I am at now.这就是我现在所处的位置。 This prints "movie" 60762 so I am a bit confused how I am short of 61492.这会打印“电影”60762，所以我有点困惑我缺少 61492。

Answer 1

from collections import Counter
c = Counter()
print ('Initial :', c)
c.update('abcdaab')
print ('Sequence:', c)
c.update({'a':1, 'd':5}) #Update with this dictionary in Counter
c.update('zz') #Updates this string in Counter
print ('Updated Value    :', c)

###################### You can use any of the method ##################
'''
Output of above Code:
Initial : Counter()
Sequence: Counter({'a': 3, 'b': 2, 'c': 1, 'd': 1})
Updated Value    : Counter({'d': 6, 'a': 4, 'b': 2, 'z': 2, 'c': 1})
'''
def word_count(words):
    counts = dict()
    for word in words:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1
    return counts

print(word_count(["foo","bar","foo","foo","bar"])) #Prints {'bar': 2, 'foo': 3}
#If you want to maintain the order, use Ordered Dict

Is this what you are looking for ?这是你想要的？ Pass the list and get unique count of words in dictionary or do you want to use Counter Specifically ?传递列表并获取字典中唯一的单词数，还是要使用 Counter 具体？ This serves the same purpose as well.这也有同样的目的。

Answer 2

I Think this is what you are looking for.我想这就是你要找的。 Tell me if it helps.告诉我是否有帮助。

from collections import Counter

count = Counter(["car","van","van","car","van"])
print(dict(count)) # for each word and its occurrence 
print(count['van']) # for specific word and its occurrence

Answer 3

I think you're nearly there.我想你快到了。 I would also strip punctuation characters at each side of a word and convert the word to lower case.我还会在单词的每一侧去除标点符号并将单词转换为小写。 So 'Movie' and 'movie' would be equivalent.所以“电影”和“电影”是等价的。

Try this:尝试这个：

import string
from collections import Counter

word_counts = Counter()

for label in [POS_LABEL, NEG_LABEL]:
    for directory in [TRAIN_DIR, TEST_DIR]:
        for fn in glob.glob(directory + "/" + label + "/*txt"):
            doc = codecs.open(fn, 'r', 'utf8') # Open the file with UTF-8 encoding
            k = [word.strip(string.punctuation) for word in doc.read().split()]
            word_counts.update(k)
            doc.close()  #remember to close the file

使用 python 和 Counters 计算一个单词在文本中出现的次数

问题描述

3 个解决方案

解决方案1
0 2020-02-07 05:53:43

解决方案2
0 2020-02-07 06:29:06

解决方案3
0 已采纳 2020-02-07 06:33:01

使用 python 和 Counters 计算一个单词在文本中出现的次数

问题描述

3 个解决方案

解决方案1 0 2020-02-07 05:53:43

解决方案2 0 2020-02-07 06:29:06

解决方案3 0 已采纳 2020-02-07 06:33:01

解决方案1
0 2020-02-07 05:53:43

解决方案2
0 2020-02-07 06:29:06

解决方案3
0 已采纳 2020-02-07 06:33:01