[英]How to find all matches with a regex where part of the match overlaps
I have a long.txt file.我有一个 long.txt 文件。 I want to find all the matching results with regex.我想用正则表达式找到所有匹配的结果。
for example:例如:
test_str = 'ali. veli. ahmet.'
src = re.finditer(r'(\w+\.\s){1,2}', test_str, re.MULTILINE)
print(*src)
this code returns:此代码返回:
<re.Match object; span=(0, 11), match='ali. veli. '>
i need;我需要;
['ali. veli', 'veli. ahmet.']
how can i do that with regex?我怎么能用正则表达式做到这一点?
The (\w+\.\s){1,2}
pattern contains a repeated capturing group , and Python re
does not store all the captures it finds, it only saves the last one into the group memory buffer. (\w+\.\s){1,2}
模式包含一个重复的捕获组,并且 Python re
不存储它找到的所有捕获,它只将最后一个保存到组 memory 缓冲区中。 At any rate, you do not need the repeated capturing group because you need to extract multiple occurrences of the pattern from a string, and re.finditer
or re.findall
will do that for you.无论如何,您不需要重复捕获组,因为您需要从字符串中提取多次出现的模式,而re.finditer
或re.findall
将为您完成。
Also, the re.MULTILINE
flag is not necessar here since there are no ^
or $
anchors in the pattern.此外,这里不需要re.MULTILINE
标志,因为模式中没有^
或$
锚点。
You may get the expected results using您可能会得到预期的结果使用
import re
test_str = 'ali. veli. ahmet.'
src = re.findall(r'(?=\b(\w+\.\s+\w+))', test_str)
print(src)
# => ['ali. veli', 'veli. ahmet']
See the Python demo请参阅Python 演示
The pattern means图案的意思
(?=
- start of a positive lookahead (?=
- 积极前瞻的开始
\b
- a word boundary (crucial here, it is necessary to only start capturing at word boundaries) \b
- 一个单词边界(这里很重要,只需要从单词边界开始捕获)(\w+\.\s+\w+)
- Capturing group 1: 1+ word chars, .
(\w+\.\s+\w+)
- 捕获组 1:1+ 字字符, .
, 1+ whitespaces and 1+ word chars , 1+ 空格和 1+ 单词字符)
- end of the lookahead. )
- 前瞻结束。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.