简体   繁体   English

python正则表达式lookbehind lookahead

[英]python regex lookbehind lookahead

I posted a question a few days ago about how to catch the words in a text preceding a certain regex match.几天前我发布了一个关于如何在某个正则表达式匹配之前的文本中捕捉单词的问题。 enter link description here 在此处输入链接描述

With the solutions proposed I play around in regex101 trying to get the words that FOLLOW the match.使用提出的解决方案,我在 regex101 中尝试获得跟随比赛的单词。

This is the code:这是代码:

content="""Lorem ipsum dolor sit amet (12,16) , consectetur 23 adipiscing elit. Curabitur (45) euismod scelerisque consectetur. Vivamus aliquam velit (46,48,49) at augue faucibus, id eleifend purus (34) egestas. Aliquam vitae mauris cursus, facilisis enim (23) condimentum, vestibulum enim. """

print(content)
pattern =re.compile(r"((?:\w+ ?){1,5}(?=\(\d))(\([\d]+\))(?: )(?:\w+ ?){1,5}")
matches = pattern.findall(content)
print('the matches are:')
print(matches)

the regex works and catches numbers between parenthesis.正则表达式有效并捕获括号之间的数字。

this being the explanation of the regex这是正则表达式的解释

((?:\w+ ?){1,5}(?=\(\d))(\([\d]+\))(?: )(?:\w+ ?){1,5}
________________________***********++++++++++++++

____ = this is the look behind. ____ = 这是后面的样子。 Looks for 1 to 5 words before the match up to finding an open (在匹配之前查找 1 到 5 个单词,直到找到一个空位 (

****= the actual regex ===> numbers between parenthesis ****= 实际的正则表达式 ===> 括号之间的数字

++++= This is the part I pretend to use to catch words AFTER the regex. ++++= 这是我假装用来在正则表达式之后捕捉单词的部分。

I tried it in regex101 with this apparently nice result:我在 regex101 中尝试过,结果显然不错:

在此处输入图片说明

But the result of the code is the following:但是代码的结果如下:

[('Curabitur ', '(45)'), ('id eleifend purus ', '(34)'), ('facilisis enim ', '(23)')]

as you see the list includes tupples with first the preceding words, and then the match itself, BUT NOT THE FOLLOWING WORDS.如您所见,该列表首先包含带有前面单词的元组,然后是匹配本身,但不包含后面的单词。

Where is the catch????钓点在哪里???

My expected result would be:我的预期结果是:

matches=[('Curabitur ', '(45)', '**euismod scelerisque consectetur**'), ('id eleifend purus ', '(34)', '**egestas**'), ('facilisis enim ', '(23)', '**condimentum**')]

Your regex needs to have a 3rd capturing group as well in order to be returned by findall :您的正则表达式还需要有第三个捕获组才能由findall返回:

>>> print re.findall(r"((?:\w+ ?){1,5}(?=\(\d))(\(\d+\))(?: )((?:\w+ ?){1,5})", content)
[('Curabitur ', '(45)', 'euismod scelerisque consectetur'), ('id eleifend purus ', '(34)', 'egestas'), ('facilisis enim ', '(23)', 'condimentum')]

Note ((?:\\w+ ?){1,5}) as 3rd capture group.注意((?:\\w+ ?){1,5})作为第三个捕获组。

Also note that [\\d]+ is equivalent of \\d+ .另请注意, [\\d]+等效于\\d+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM