简体   繁体   English

Python Regex:捕获重叠部分

[英]Python Regex: Capture overlapping parts

Given a string s = "<foo>abcaaa<bar>a<foo>cbacba<foo>c" I'm trying to write a regular expression which will extract portions of: angle brackets with the text inside and the surrounding text.给定一个字符串 s = "<foo>abcaaa<bar>a<foo>cbacba<foo>c"我正在尝试编写一个正则表达式,它将提取以下内容的部分:尖括号,里面的文本和周围的文本。 Like this:像这样:

<foo>abcaaa
abcaaa<bar>a
a<foo>cbacba
cbacba<foo>c

So expected output should look like this:所以预期的输出应该是这样的:

["<foo>abcaaa", "abcaaa<bar>a", "a<foo>cbacba", "cbacba<foo>c"]

I found this question How to find overlapping matches with a regexp?我发现这个问题如何找到与正则表达式重叠的匹配? which brought me little bit closer to the desired result but still my regex doesn't work.这让我更接近预期的结果,但我的正则表达式仍然不起作用。

regex = r"(?=([a-c]*)\<(\w+)\>([a-c]*))"

Any ideas how to solve this problem?任何想法如何解决这个问题?

You need to set the left- and right-hand boundaries to < or > chars or start/end of string.您需要将左右边界设置为<>字符或字符串的开始/结束。

Use

import re
text = "<foo>abcaaa<bar>a<foo>cbacba<foo>c"
print( re.findall(r'(?=(?<![^<>])([a-c]*<\w+>[a-c]*)(?![^<>]))', text) )
# => ['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']

See the Python demo online and the regex demo .请参阅在线 Python 演示正则表达式演示

Pattern details图案详情

  • (?= - start of a positive lookahead to enable overlapping matches (?= - 开始正向预测以启用重叠匹配
    • (?<![^<>]) - start of string, < or > (?<![^<>]) - 字符串的开始, <>
    • ([ac]*<\\w+>[ac]*) - Group 1 (the value extracted): 0+ a , b or c chars, then < , 1+ word chars, > and again 0+ a , b or c chars ([ac]*<\\w+>[ac]*) - 第 1 组(提取的值):0+ abc字符,然后是< 、1+ 字字符、 >和 0+ abc字符
    • (?![^<>]) - end of string, < or > must follow immediately (?![^<>]) - 字符串结束, <>必须紧跟
  • ) - end of the lookahead. ) - 前瞻结束。

You may use this regex code in python:你可以在 python 中使用这个正则表达式代码:

>>> s = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
>>> reg = r'([^<>]*<[^>]*>)(?=([^<>]*))'
>>> print ( [''.join(i) for i in re.findall(reg, s)] )
['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']

RegEx Demo正则表达式演示

RegEx Details:正则表达式详情:

  • ([^<>]*<[^>]*>) : Capture group #1 to match 0 or more characters that are not < and > followed by <...> string. ([^<>]*<[^>]*>) :捕获组 #1 以匹配 0 个或多个不是<>后跟<...>字符串的字符。
  • (?=([^<>]*)) : Lookahead to assert that we have 0 or more non- <> characters ahead of current position. (?=([^<>]*)) :先行断言我们在当前位置之前有 0 个或多个非<>字符。 We have capture group #2 inside this lookahead.我们在这个前瞻中有捕获组 #2。

You can match overlapping content with standard regex syntax by using capturing groups inside lookaround assertions, since those may match parts of the string without consuming the matched substring and hence precluding it from further matches.您可以通过在环视断言中使用捕获组来匹配具有标准正则表达式语法的重叠内容,因为这些组可能会匹配字符串的一部分,而不会消耗匹配的子字符串,从而阻止它进一步匹配。 In this specific example, we match either the beginning of the string or a > as anchor for the lookahead assertion which captures our actual targets:在这个特定示例中,我们匹配字符串的开头或>作为捕获我们实际目标的前瞻断言的锚点:

(?:\A|>)(?=([a-c]*<\w+>[a-c]*))

See regex demo .请参阅正则表达式演示

In python we then use the property of re.findall() to only return matches captured in groups when capturing groups are present in the expression:在 python 中,当表达式中存在捕获组时,我们使用re.findall()的属性仅返回在组中捕获的匹配项:

text = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
expr = r'(?:\A|>)(?=([a-c]*<\w+>[a-c]*))'
captures = re.findall(expr, text)
print(captures)

Output:输出:

['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM