简体   繁体   English

使用 python 和 Counters 计算一个单词在文本中出现的次数

[英]Counting how many times a word appears in a text using python and Counters

I have a dictionary and a file with a lot of text (IMDB reviews).我有一本字典和一个包含大量文本的文件(IMDB 评论)。 I am trying to update the dictionary with the number of times it has seen a word.我试图用它看到一个词的次数来更新字典。 I can get it to count how many letters are in the text file but I need the dictionary to have a count of the words instead.我可以让它计算文本文件中有多少个字母,但我需要字典来计算单词数。 This is my current code:这是我当前的代码:

import glob
import codecs
word_counts = Counter() # I require this

for label in [POS_LABEL, NEG_LABEL]:
    for directory in [TRAIN_DIR, TEST_DIR]:
        for fn in glob.glob(directory + "/" + label + "/*txt"):
            doc = codecs.open(fn, 'r', 'utf8') # Open the file with UTF-8 encoding
            k = doc.read().split()
            print(k)

print(word_counts)

Here is the result of this:这是结果:

Counter({' ': 11507297, 'e': 6036511, 't': 4565527, 'a': 3979283, 'o': 3754514, 'i': 3654084, 's': 3342252, 'n': 3321375, 'r': 3003149, 'h': 2701649, 'l': 2194831, 'd': 1715235, 'c': 1351529, 'u': 1345564, 'm': 1317735, 'f': 1085018, 'y': 1031725, 'g': 1016124, 'w': 936183, 'b': 929635, 'p': 824208, '.': 650520, 'v': 617921, ',': 544818, 'k': 414662, 'I': 264480, "'": 263760, 'T': 220030, '/': 215720, '>': 202250, '<': 202094, '-': 132038, '"': 131858, 'A': 129582, 'S': 119546, 'B': 89001, 'x': 84005, 'M': 83700, 'H': 79548, 'C': 77492, 'D': 76272, 'j': 74962, ')': 71268, 'W': 70424, '(': 69585, 'E': 62149, 'R': 59387, 'L': 55859, 'O': 54998, 'N': 51707, 'P': 49662, '!': 49164, 'F': 47474, 'G': 47367, 'J': 41145, 'z': 40445, '0': 37357, 'q': 37098, '1': 35792, '?': 32338, 'V': 29518, 'K': 28837, 'Y': 21969, ':': 19800, '9': 19392, 'U': 17488, '2': 15978, '*': 13916, ';': 13375, '3': 11002, '5': 10457, '8': 8874, '4': 8342, '7': 8277, '&': 7714, '6': 6209, 'Z': 4490, 'é': 3337, 'Q': 2842, '\x96': 2529, 'X': 1957, '`': 1861, '$': 1617, '\x85': 1479, '_': 997, '%': 867, '+': 642, '#': 640, '=': 623, '\x97': 596, '´': 434, ']': 254, '’': 254, '[': 239, '~': 230, 'á': 208, '{': 192, '}': 192, '@': 181, 'è': 169, 'ö': 160, '–': 149, 'ó': 126, '\x91': 121, '£': 117, 'ü': 106, '\t': 106, 'í': 100, '^': 95, 'ä': 91, 'ç': 82, 'à': 80, 'ñ': 78, 'ô': 66, '¨': 64, 'ï': 58, '“': 57, '”': 55, '»': 55, '«': 53, 'ã': 48, 'â': 45, '|': 45, '\xa0': 44, '¡': 43, '½': 39, 'å': 36, 'ê': 36, '\\': 35, 'ë': 32, '\x84': 31, '·': 29, 'ú': 23, 'ý': 22, 'ø': 19, '\x8e': 18, '\x9e': 18, '‘': 18, '\x95': 17, '…': 16, '¦': 14, '§': 13, 'É': 10, 'ß': 10, 'î': 9, '\x80': 8, 'ð': 8, 'Æ': 8, 'Õ': 7, '\uf0b7': 7, 'Á': 6, 'ì': 6, 'æ': 6, 'Ü': 6, 'û': 6, 'ù': 6, 'ò': 5, '\xad': 5, 'Ö': 5, '、': 5, '\x08': 4, '°': 4, '®': 4, 'ō': 4, '¾': 4, 'Ã': 3, '¿': 3, 'À': 3, 'Å': 3, 'Ó': 3, '\x8d': 3, '¤': 2, 'Ê': 2, '₤': 2, 'Ä': 2, 'È': 2, 'Þ': 2, ',': 2, '¢': 2, 'º': 2, '▼': 2, '★': 2, '³': 1, '\x9a': 1, 'Ø': 1, 'Ï': 1, 'Â': 1, 'Ç': 1, 'Ð': 1, 'ı': 1, 'ğ': 1, '″': 1, '©': 1, 'ª': 1, '\x10': 1, 'Ż': 1, 'י': 1, 'ג': 1, 'א': 1, 'ל': 1, 'כ': 1, 'ר': 1, 'מ': 1, 'ו': 1, 'ן': 1, 'õ': 1})

Forgot to mention I tried something like this too:忘了提到我也尝试过这样的事情:

k = doc.read().split()
word_counts.update(k)

inside of the third for loop.在第三个 for 循环内。

I was told I was in the right direction:有人告诉我,我的方向是正确的:

if word_counts["movie"] == 61492

UPDATE:更新:

for label in [POS_LABEL, NEG_LABEL]:
    for directory in [TRAIN_DIR, TEST_DIR]:
        for fn in glob.glob(directory + "/" + label + "/*txt"):
            doc = codecs.open(fn, 'r', 'utf8') # Open the file with UTF-8 encoding
            k = doc.read().split()
            word_counts.update(k)


print(word_counts["movie"])

This is where I am at now.这就是我现在所处的位置。 This prints "movie" 60762 so I am a bit confused how I am short of 61492.这会打印“电影”60762,所以我有点困惑我缺少 61492。

from collections import Counter
c = Counter()
print ('Initial :', c)
c.update('abcdaab')
print ('Sequence:', c)
c.update({'a':1, 'd':5}) #Update with this dictionary in Counter
c.update('zz') #Updates this string in Counter
print ('Updated Value    :', c)

###################### You can use any of the method ##################
'''
Output of above Code:
Initial : Counter()
Sequence: Counter({'a': 3, 'b': 2, 'c': 1, 'd': 1})
Updated Value    : Counter({'d': 6, 'a': 4, 'b': 2, 'z': 2, 'c': 1})
'''
def word_count(words):
    counts = dict()
    for word in words:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1
    return counts

print(word_count(["foo","bar","foo","foo","bar"])) #Prints {'bar': 2, 'foo': 3}
#If you want to maintain the order, use Ordered Dict

Is this what you are looking for ?这是你想要的 ? Pass the list and get unique count of words in dictionary or do you want to use Counter Specifically ?传递列表并获取字典中唯一的单词数,还是要使用 Counter 具体? This serves the same purpose as well.这也有同样的目的。

I Think this is what you are looking for.我想这就是你要找的。 Tell me if it helps.告诉我是否有帮助。

from collections import Counter

count = Counter(["car","van","van","car","van"])
print(dict(count)) # for each word and its occurrence 
print(count['van']) # for specific word and its occurrence

I think you're nearly there.我想你快到了。 I would also strip punctuation characters at each side of a word and convert the word to lower case.我还会在单词的每一侧去除标点符号并将单词转换为小写。 So 'Movie' and 'movie' would be equivalent.所以“电影”和“电影”是等价的。

Try this:尝试这个:

import string
from collections import Counter

word_counts = Counter()

for label in [POS_LABEL, NEG_LABEL]:
    for directory in [TRAIN_DIR, TEST_DIR]:
        for fn in glob.glob(directory + "/" + label + "/*txt"):
            doc = codecs.open(fn, 'r', 'utf8') # Open the file with UTF-8 encoding
            k = [word.strip(string.punctuation) for word in doc.read().split()]
            word_counts.update(k)
            doc.close()  #remember to close the file

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM