简体   繁体   English

查找包含特定单词的所有句子

[英]Find all sentences containing specific words

I have a string consisting of sentences and want to find all sentences that contain at least one specific keyword, ie keyword1 or keyword2 :我有一个由句子组成的字符串,并且想要查找包含至少一个特定关键字的所有句子,即keyword1 1 或keyword2

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"([A-Z][^\.!?].*(keyword1)|(keyword2).*[\.!?])\s")
for match in pattern.findall(s):
    print(match)

Output:输出:

('This is a sentence which contains keyword1', 'keyword1', '')
('keyword2 is inside this sentence. ', '', 'keyword2')

Expected Output:预期输出:

('This is a sentence which contains keyword1', 'keyword1', '')
('And keyword2 is inside this sentence. ', '', 'keyword2')

As you can see, the second match doesn't contain the whole sentence in the first group.如您所见,第二个匹配项不包含第一组中的整个句子。 What am I missing here?我在这里想念什么?

You can use a negated character class to not match .您可以使用否定字符类来不匹配. ! and ?? and put the keywords in the same group to prevent the empty string in the result.并将关键字放在同一组中以防止结果中出现空字符串。

Then re.findall returns the capture group values, which is group 1 for the whole match, and group 2, 3 etc.. for one of the keywords.然后 re.findall 返回捕获组值,即整个匹配的第 1 组,以及其中一个关键字的第 2、3 组等。

([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s

Explanation解释

  • ( Capture group 1 (捕获组 1
    • [AZ][^.!?]* Match an uppercase char AZ and optionally any char except one of .!? [AZ][^.!?]*匹配大写字符 AZ 和可选的任何字符,除了.!?
    • (?:(keyword1)|(keyword2)) Capture one of the keywords in their own group (?:(keyword1)|(keyword2))捕获自己组中的关键字之一
    • [^.!?]*[.!?] Match any char except one of .!? [^.!?]*[.!?]匹配除.!?之外的任何字符and then match one of .!?然后匹配.!?之一
  • ) Close group 1 )关闭第 1 组
  • \s Match a whitespace char \s匹配一个空白字符

See a regex demo and a Python demo .请参阅正则表达式演示Python 演示

Example例子

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s")
for match in pattern.findall(s):
    print(match)

Output输出

('This is a sentence which contains keyword1.', 'keyword1', '')
('And keyword2 is inside this sentence.', '', 'keyword2')

You can try following regular expression:您可以尝试以下正则表达式:

[.?!]*\s*(.*(keyword1)[^.?!]*[.?!]|.*(keyword2)[^.?!]*[.?!])

Code:代码:

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"[.?!]*\s*(.*(keyword1)[^.?!]*[.?!]|.*(keyword2)[^.?!]*[.?!])")
for match in pattern.findall(s):
    print(match)

Output:输出:

('This is a sentence which contains keyword1.', 'keyword1', '')
('And keyword2 is inside this sentence.', '', 'keyword2')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM