如何正确计算字符串中给定单词的出现次数而不计算作为 Python 中不同单词的子字符串的单词？

Question

I want to calculate the occurrences of a given word in an article.我想计算文章中给定单词的出现次数。 I tried to use split method to cut the articles into n pieces and calculate the length like this.我尝试使用split方法将文章切成n块并像这样计算长度。

def get_occur(str, word):
    lst = str.split(word)
    return len(lst) - 1

But the problem is, I will always count the word additionally if the word is a substring of another word.但问题是，如果这个词是另一个词的子串，我总是会额外计算这个词。 For example, I only want to count the number of "sad" in this sentence "I am very sad and she is a saddist" .例如，我只想计算"I am very sad and she is a saddist"这句话中"sad"的数量。 It should be one, but because "sad" is part of "saddist", I will count it accidentally.应该是一个，但是因为“悲伤”是“虐待狂”的一部分，所以我会不小心计算它。 If I use " sad " , I will omit words that are at the start and end of sentences.如果我使用" sad " ，我会省略句首和句尾的词。 Plus, I am dealing with huge number of articles so it is most desirable that I don't have to compare each word.另外，我正在处理大量文章，因此最理想的是我不必比较每个词。 How can I address this?我该如何解决这个问题？ Much appreciated.非常感激。

Answer 1

As mentioned by @ schwobaseggl in the comment this will miss the word before the comma and there may be other cases so I have updated the answer.正如@schwobaseggl在评论中提到的那样，这会漏掉逗号前的单词，而且可能还有其他情况，所以我已经更新了答案。

from nltk.tokenize import word_tokenize
text = word_tokenize(text)

This will give you a list of words.这会给你一个单词列表。 Now use the below code现在使用下面的代码

count = 0
for word in text:
 if (word.lower() == 'sad'): # .lower to make it case-insensitive
   count += 1

Answer 2

You can use regular expressions:您可以使用正则表达式：

import re

def count(text, pattern):
    return len(re.findall(rf"\b{pattern}\b", text, flags=re.IGNORECASE))

\\b marks word boundaries and the passed flag makes the matching case insensitive: \\b标记单词边界，传递的标志使匹配不区分大小写：

>>> count("Sadly, the SAD man is sad.", "sad")
2

If you want to only count lower-case occurrences, just omit the flag.如果您只想计算小写字母的出现次数，只需省略该标志。

Answer 3

string = "I am very sad and she is sadder"
substring = " sad "

count = string.count(substring)

如何正确计算字符串中给定单词的出现次数而不计算作为 Python 中不同单词的子字符串的单词？

问题描述

2 个解决方案

解决方案1
0 2021-07-23 06:16:35

解决方案2
0 已采纳 2021-07-23 06:45:22

解决方案3
-1 2021-07-23 06:16:36

如何正确计算字符串中给定单词的出现次数而不计算作为 Python 中不同单词的子字符串的单词？

问题描述

2 个解决方案

解决方案1 0 2021-07-23 06:16:35

解决方案2 0 已采纳 2021-07-23 06:45:22

解决方案3 -1 2021-07-23 06:16:36

解决方案1
0 2021-07-23 06:16:35

解决方案2
0 已采纳 2021-07-23 06:45:22

解决方案3
-1 2021-07-23 06:16:36