简体   繁体   English

计算文本中每个单词的出现次数 - Python

[英]Count the number occurrences of each word in a text - Python

I know that I can find a word in a text/array with this:我知道我可以在文本/数组中找到一个单词:

if word in text: 
   print 'success'

What I want to do is read a word in a text, and keep counting as many times as the word is found (it is a simple counter task).我想做的是阅读文本中的一个单词,并根据找到的单词不断计数(这是一个简单的计数器任务)。 But the thing is I do not really know how to read words that have already been read.但问题是我真的不知道如何read已经read单词。 In the end: count the number occurrences of each word?到底:统计每个单词出现的次数?

I have thought of saving in an array (or even multidimensional array, so save the word and the number of times it appears, or in two arrays), summing 1 every time it appears a word in that array.我想过保存在一个数组中(甚至是多维数组,所以保存单词和它出现的次数,或者在两个数组中),每次在该数组中出现一个单词时求和 1。

So then, when I read a word, can I NOT read it with something similar to this:那么,当我读一个词时,我能不能用类似的东西来读它:

if word not in wordsInText: 
       print 'success'
sentence = 'a quick brown fox jumped a another fox'

words = sentence.split(' ')

solution 1:解决方案1:

result = {i:words.count(i) for i in set(words)}

solution 2:解决方案2:

result = {}    
for word in words:                                                                                                                                                                                               
    result[word] = result.get(word, 0) + 1     

solution 3:解决方案3:

from collections import Counter    
result = dict(Counter(words))

Now that we established what you're trying to achieve, I can give you an answer.既然我们已经确定了你想要达到的目标,我可以给你一个答案。 Now the first thing you need to do is convert the text into a list of words.现在您需要做的第一件事是将文本转换为单词列表。 While the split method might look like a good solution, it will create a problem in the actual counting when sentences end with a word, followed by a full stop, commas or any other characters.虽然split方法可能看起来是一个很好的解决方案,但当句子以一个单词结尾,后跟一个句号、逗号或任何其他字符时,它会在实际计数中产生问题。 So a good solution for this problem would be NLTK .所以这个问题的一个很好的解决方案是NLTK Assume that the text you have is stored in a variable called text .假设您拥有的文本存储在名为text的变量中。 The code you are looking for would look something like this:您正在寻找的代码如下所示:

from itertools import chain
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize

text = "This is an example text. Let us use two sentences, so that it is more logical."
wordlist = list(chain(*[word_tokenize(s) for s in sent_tokenize(text)]))
print(Counter(wordlist))
# Counter({'.': 2, 'is': 2, 'us': 1, 'more': 1, ',': 1, 'sentences': 1, 'so': 1, 'This': 1, 'an': 1, 'two': 1, 'it': 1, 'example': 1, 'text': 1, 'logical': 1, 'Let': 1, 'that': 1, 'use': 1})

What I understand is that you want to keep words already read so as you can detect if you encounter a new word.我的理解是,您希望保留已读过的单词,以便您可以检测是否遇到新单词。 Is that OK ?可以吗? The easiest solution for that is to use a set, as it automatically removes duplicates.最简单的解决方案是使用集合,因为它会自动删除重复项。 For instance:例如:

known_words = set()
for word in text:
    if word not in known_words:
        print 'found new word:', word
    known_word.add(word)

On the other hand, if you need the exact number of occurrences for each word (this is called "histogram" in maths), you have to replace the set by a dictionary:另一方面,如果您需要每个单词的确切出现次数(这在数学中称为“直方图”),则必须用字典替换该集合:

histo = {}
for word in text:
    histo[word] = histo.get(word, 0) + 1
print histo

Note: In both solutions, I suppose that text contains an iterable structure of words.注意:在这两种解决方案中,我认为文本包含可迭代的单词结构。 As said by other comments, str.split() is not totally safe for this.正如其他评论所说, str.split()对此并不完全安全。

I would use one of these methods:我会使用以下方法之一:

1) If the word doesn't contain spaces, but the text does, use 1) 如果单词不包含空格,但文本包含,请使用

for piece in text.split(" "):
   ...

Then your word should occur at most once in each piece, and be counted correctly.那么你的单词应该在每个片段中最多出现一次,并且被正确计算。 This fails if you for example want to count "Baden" twice in "Baden-Baden".例如,如果您想在“Baden-Baden”中计算两次“Baden”,这将失败。

2) Use the string method 'find' to get not only whether the word is there, but where it is. 2) 使用字符串方法 'find' 不仅可以获取单词是否存在,还可以获取它在哪里。 Count it, and then continue searching from beyond that point.数一数,然后从该点之后继续搜索。 text.find(word) returns either a position, or -1. text.find(word) 返回一个位置或 -1。

Several options can be used but I suggest you do the following :可以使用多个选项,但我建议您执行以下操作:

  • Replace special characters in your text in order to uniformize it.替换文本中的特殊字符以使其统一。
  • Split the cleared sentence.拆分清除的句子。
  • Use collections.Counter使用collections.Counter

And the code will look like...代码看起来像......

from collections import Counter

my_text = "Lorem ipsum; dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut. labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

special_characters = ',.;'
for char in special_characters:
    my_text = my_text.replace(char, ' ')

print Counter(my_text.split())

I believe the safer approach would be to use the answer with NLTK, but sometimes, understanding what you are doing feels great.我相信更安全的方法是将答案与 NLTK 一起使用,但有时,了解您在做什么感觉很棒。

There is no need to tokenize sentence.不需要标记句子。 Answer from Alexander Ejbekov could be simplified as: Alexander Ejbekov 的回答可以简化为:

from itertools import chain
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize

text = "This is an example text. Let us use two sentences, so that it is more logical."
wordlist = word_tokenize(text) 
print(Counter(wordlist))
# Counter({'is': 2, '.': 2, 'This': 1, 'an': 1, 'example': 1, 'text': 1, 'Let': 1, 'us': 1, 'use': 1, 'two': 1, 'sentences': 1, ',': 1, 'so': 1, 'that': 1, 'it': 1, 'more': 1, 'logical': 1})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM