简体   繁体   English

如何正确计算字符串中给定单词的出现次数而不计算作为 Python 中不同单词的子字符串的单词?

[英]How to correctly count the occurrences of a given word in a string without counting the word that is a substring of a different word in Python?

I want to calculate the occurrences of a given word in an article.我想计算文章中给定单词的出现次数。 I tried to use split method to cut the articles into n pieces and calculate the length like this.我尝试使用split方法将文章切成n块并像这样计算长度。

def get_occur(str, word):
    lst = str.split(word)
    return len(lst) - 1

But the problem is, I will always count the word additionally if the word is a substring of another word.但问题是,如果这个词是另一个词的子串,我总是会额外计算这个词。 For example, I only want to count the number of "sad" in this sentence "I am very sad and she is a saddist" .例如,我只想计算"I am very sad and she is a saddist"这句话中"sad"的数量。 It should be one, but because "sad" is part of "saddist", I will count it accidentally.应该是一个,但是因为“悲伤”是“虐待狂”的一部分,所以我会不小心计算它。 If I use " sad " , I will omit words that are at the start and end of sentences.如果我使用" sad " ,我会省略句首和句尾的词。 Plus, I am dealing with huge number of articles so it is most desirable that I don't have to compare each word.另外,我正在处理大量文章,因此最理想的是我不必比较每个词。 How can I address this?我该如何解决这个问题? Much appreciated.非常感激。

As mentioned by @ schwobaseggl in the comment this will miss the word before the comma and there may be other cases so I have updated the answer.正如@schwobaseggl在评论中提到的那样,这会漏掉逗号前的单词,而且可能还有其他情况,所以我已经更新了答案。

from nltk.tokenize import word_tokenize
text = word_tokenize(text)

This will give you a list of words.这会给你一个单词列表。 Now use the below code现在使用下面的代码

count = 0
for word in text:
 if (word.lower() == 'sad'): # .lower to make it case-insensitive
   count += 1

You can use regular expressions:您可以使用正则表达式:

import re

def count(text, pattern):
    return len(re.findall(rf"\b{pattern}\b", text, flags=re.IGNORECASE))

\\b marks word boundaries and the passed flag makes the matching case insensitive: \\b标记单词边界,传递的标志使匹配不区分大小写:

>>> count("Sadly, the SAD man is sad.", "sad")
2

If you want to only count lower-case occurrences, just omit the flag.如果您只想计算小写字母的出现次数,只需省略该标志。

string = "I am very sad and she is sadder"
substring = " sad "

count = string.count(substring)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM