简体   繁体   中英

Find all sentences containing specific words

I have a string consisting of sentences and want to find all sentences that contain at least one specific keyword, ie keyword1 or keyword2 :

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"([A-Z][^\.!?].*(keyword1)|(keyword2).*[\.!?])\s")
for match in pattern.findall(s):
    print(match)

Output:

('This is a sentence which contains keyword1', 'keyword1', '')
('keyword2 is inside this sentence. ', '', 'keyword2')

Expected Output:

('This is a sentence which contains keyword1', 'keyword1', '')
('And keyword2 is inside this sentence. ', '', 'keyword2')

As you can see, the second match doesn't contain the whole sentence in the first group. What am I missing here?

You can use a negated character class to not match . ! and ? and put the keywords in the same group to prevent the empty string in the result.

Then re.findall returns the capture group values, which is group 1 for the whole match, and group 2, 3 etc.. for one of the keywords.

([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s

Explanation

  • ( Capture group 1
    • [AZ][^.!?]* Match an uppercase char AZ and optionally any char except one of .!?
    • (?:(keyword1)|(keyword2)) Capture one of the keywords in their own group
    • [^.!?]*[.!?] Match any char except one of .!? and then match one of .!?
  • ) Close group 1
  • \s Match a whitespace char

See a regex demo and a Python demo .

Example

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s")
for match in pattern.findall(s):
    print(match)

Output

('This is a sentence which contains keyword1.', 'keyword1', '')
('And keyword2 is inside this sentence.', '', 'keyword2')

You can try following regular expression:

[.?!]*\s*(.*(keyword1)[^.?!]*[.?!]|.*(keyword2)[^.?!]*[.?!])

Code:

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"[.?!]*\s*(.*(keyword1)[^.?!]*[.?!]|.*(keyword2)[^.?!]*[.?!])")
for match in pattern.findall(s):
    print(match)

Output:

('This is a sentence which contains keyword1.', 'keyword1', '')
('And keyword2 is inside this sentence.', '', 'keyword2')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM