简体   繁体   中英

How to find the count/occurrence of one string(can be multi-word) in another string(sentence) in python

I have to count the occurrence of a string(which can be 1 or more words) in another string (which is a sentence) and should not be case-sensitive.

For instance -

a = "Hi my name is Alex and hi to you as well. How high is the building? The highest floor is 18th. Highlights .... She said hi as well. Do you know highlights of the match ... hi."

b = "hi" #word/sentence to find count of

I tried -

a.lower().count(b) 

which returns

>> 8 

while the required answer should be

>> 4.

For multi-word, this method seems to work but I am not sure of the limiting cases. How can I fix this?

You can use re.findall to search for the substring with leading and trailing word boundaries:

import re

print(len(re.findall(r'\b{}\b'.format(b), a, re.I))) # -> 4
#                      ^   ^
#                      |___|_ word boundaries  ^
#                                              |_ case insensitive match

The function works just fine: the sequence "hi" appears 8 times in the string. Since you want it only as words, you'll need to figure out how you can differentiate the word "hi" from the incidental appearance in other words, such as "chipper".

One common way is to use the re package (regular expressions), but that may be more learning then you want to do right now.

A better way at the moment would be to split the string into words before you check each:

word_list = a.lower().split()
b_count = word_list.count(b)

Note that this considers only spaces when dividing words. It still won't find "hi" in "hi-performance", for example. You'd need another split operation for other separators.

"Spliting" a sentence into words is not trivial.

There in a package in python to do that : nltk.

First install this package using pip or system specific package manager.

Then run ipython and use nltk.download() to download "punkt" data : type d then type punkt . Then quit q .

Then use

tokens = nltk.word_tokenize(a)
len(list(filter(lambda x: x.lower() == b, tokens))

it returns 4.

Use str.split() and filter out punctuation with regex:

import re
a = "Hi my name is Alex and hi to you as well. How high is the building? The highest floor is 18th. Highlights .... She said hi as well. Do you know highlights of the match ... hi."
b = "hi"
final_count = sum(re.sub("\W+", '', i.lower()) == b for i in a.split())

Output:

4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM