简体   繁体   English

如何使用正则表达式查找所有匹配项,其中部分匹配项重叠

[英]How to find all matches with a regex where part of the match overlaps

I have a long.txt file.我有一个 long.txt 文件。 I want to find all the matching results with regex.我想用正则表达式找到所有匹配的结果。

for example:例如:

test_str = 'ali. veli. ahmet.'
src = re.finditer(r'(\w+\.\s){1,2}', test_str, re.MULTILINE)
print(*src)

this code returns:此代码返回:

<re.Match object; span=(0, 11), match='ali. veli. '>

i need;我需要;

['ali. veli', 'veli. ahmet.']

how can i do that with regex?我怎么能用正则表达式做到这一点?

The (\w+\.\s){1,2} pattern contains a repeated capturing group , and Python re does not store all the captures it finds, it only saves the last one into the group memory buffer. (\w+\.\s){1,2}模式包含一个重复的捕获组,并且 Python re不存储它找到的所有捕获,它只将最后一个保存到组 memory 缓冲区中。 At any rate, you do not need the repeated capturing group because you need to extract multiple occurrences of the pattern from a string, and re.finditer or re.findall will do that for you.无论如何,您不需要重复捕获组,因为您需要从字符串中提取多次出现的模式,而re.finditerre.findall将为您完成。

Also, the re.MULTILINE flag is not necessar here since there are no ^ or $ anchors in the pattern.此外,这里不需要re.MULTILINE标志,因为模式中没有^$锚点。

You may get the expected results using您可能会得到预期的结果使用

import re
test_str = 'ali. veli. ahmet.'
src = re.findall(r'(?=\b(\w+\.\s+\w+))', test_str)
print(src)
# => ['ali. veli', 'veli. ahmet']

See the Python demo请参阅Python 演示

The pattern means图案的意思

  • (?= - start of a positive lookahead (?= - 积极前瞻的开始
    • \b - a word boundary (crucial here, it is necessary to only start capturing at word boundaries) \b - 一个单词边界(这里很重要,只需要从单词边界开始捕获)
    • (\w+\.\s+\w+) - Capturing group 1: 1+ word chars, . (\w+\.\s+\w+) - 捕获组 1:1+ 字字符, . , 1+ whitespaces and 1+ word chars , 1+ 空格和 1+ 单词字符
  • ) - end of the lookahead. ) - 前瞻结束。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM