[英]How to match alternatives with python regex
given string 1: 给定字符串1:
'''TOM likes to go swimming MARY loves to go to the playground JANE likes going shopping''' '''TOM喜欢游泳MARY喜欢去游乐场JANE喜欢购物'''
I want to capture the text between only 2 names. 我只想捕获两个名称之间的文本。 Either Tom and Mary or Tom and Jane.
汤姆和玛丽或汤姆和简。 If Mary appears before Jane, I would like to capture the text between Tom and Mary.
如果Mary出现在Jane之前,我想捕捉Tom和Mary之间的文本。 However, if Jane appears first, I would like to capture the text between Tom and Jane.
但是,如果简首先出现,我想捕捉汤姆和简之间的文字。
I have written the following code: 我写了以下代码:
text = re.compile(r'''(
TOM\s*
([\w\W]+)\s*
JANE|MARY
)''', re.VERBOSE)
text_out = text.search(string).group(1)
However, this code would give me the text between Tom and Jane, even though Mary appears first. 但是,即使玛丽首先出现,此代码也会给我提供Tom和Jane之间的文本。 I understand that this is because the pipe function reads from left to right and therefore will match Jane first.
我知道这是因为管道函数从左到右读取,因此将首先匹配Jane。 Is there a way to code this such that it depends on who appears first in the text?
有没有一种方法可以对此进行编码,以使其取决于谁首先出现在文本中?
for example, in string2: "'''TOM likes to go swimming JANE likes going shopping MARY loves to go to the playground ''' 例如,在string2中:“'''TOM喜欢去游泳JANE喜欢去购物MARY喜欢去游乐场'''
I would like to capture the text between Tom and Jane for string2. 我想要捕获Tom和Jane之间的string2文本。
You need to fix your alternation, it must be enclosed with a non-capturing group (?:JANE|MARY)
, and use a lazy quantifier with [\\w\\W]
(that I would replace with .*
and use re.DOTALL
modifier to make the dot to also match line breaks): 您需要修复替换,它必须包含在一个非捕获组
(?:JANE|MARY)
,并使用带有[\\w\\W]
的惰性量词(我将用.*
替换并使用re.DOTALL
使点也与换行符匹配的修饰符):
(?s)TOM\s*(.+?)\s*(?:JANE|MARY)
See the regex demo 见正则表达式演示
Without the (?:...|...)
, your regex matched Tom
, then any 1+ chars as many as possible (that is, the regex grabbed the whole string, and then backtracked to match the last occurrence of the subsequent subpattern, JANE
) and JANE
, or MARY
substring. 如果不使用
(?:...|...)
,则您的正则表达式匹配Tom
,那么任何1个以上的字符都应尽可能多(即,正则表达式捕获了整个字符串,然后回溯以匹配后续出现的最后一个字符)子模式( JANE
)和JANE
或MARY
子字符串。 Now, the fixed regex matches: 现在,固定的正则表达式匹配:
(?s)
- DOTALL inline modifier (?s)
-DOTALL内联修饰符 TOM
- a literal char sequence TOM
文字字符序列 \\s*
- 0+ whitespaces \\s*
-0+空格 (.+?)
- Group 1 (capturing): any 1+ chars, as few as possible, up to the first occurrence of the subsequent subpatterns.... (.+?)
-组1(捕获):直到后继子模式的第一个出现为止,尽可能少的 1+个字符。 \\s*
- 0+ whitespaces \\s*
-0+空格 (?:JANE|MARY)
- either JANE
or MARY
substring. (?:JANE|MARY)
JANE
或MARY
子字符串。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.