I have a large list of strings and I want to check whether a string occurs in a larger string. The list contains of strings of one word and also strings of multiple words. To do so I have written the following code:
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has kneepain as wel as a headache"
emptylist = []
for i in example_text:
res = [ele for ele in example_list if(ele in i)]
emptylist.append(res)
However the problem is here is 'pain' is also added to emptylist which it should not as I only want something from the example_list to be added if exactly matches the text. I also tried using sets:
word_set = set(example_list)
phrase_set = set(example_text.split())
word_set.intersection(phrase_set)
This however chops op 'morning sickness' into 'morning' and 'sickness'. Does anyone know what is the correct way to tackle this problem?
Using PyParsing:
import pyparsing as pp
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has kneepain as wel as a headache morning sickness"
list_of_matches = []
for word in example_list:
rule = pp.OneOrMore(pp.Keyword(word))
for t, s, e in rule.scanString(example_text):
if t:
list_of_matches.append(t[0])
print(list_of_matches)
Which yields:
['headache', 'sickness', 'morning sickness']
Nice examples have already been provided in this post by members.
I made the matching_text a little more challenging where the pain occurred more than once. I also aimed for a little more information about where the match location starts. I ended up with the following code.
I worked on the following sentence.
"The patient has not only kneepain but headache and arm pain, stomach pain and sickness"
import re
from collections import defaultdict
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has not only kneepain but headache and arm pain, stomach pain and sickness"
TruthFalseDict = defaultdict(list)
for i in example_list:
MatchedTruths = re.finditer(r'\b%s\b'%i, example_text)
if MatchedTruths:
for j in MatchedTruths:
TruthFalseDict[i].append(j.start())
print(dict(TruthFalseDict))
The above gives me the following output.
{'pain': [55, 69], 'headache': [38], 'sickness': [78]}
You should be able to use a regex using word boundaries
>>> import re
>>> [word for word in example_list if re.search(r'\b{}\b'.format(word), example_text)]
['headache']
This will not match 'pain'
in 'kneepain'
since that does not begin with a word boundary. But it would properly match substrings that contained whitespace.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.