[英]Why does lazy regex capture extra words?
I am using following lazy regex to find word before and after "=". 我正在使用以下惰性正则表达式来查找“ =”之前和之后的词。 I am not sure why it capturing extra words
我不确定为什么它会捕获多余的单词
r'\s+(.*?)\s+\=\s+(.*?)\s+'
The text is in format 文字格式
my name = jil
part = #2
So I want to capture name = jil 所以我想捕获名称= jil
am i doing something wrong here or can I do it in different manner. 我在这里做错了还是可以不同的方式做。
Note : Before and after "=" we can have special characters 注意:在“ =”之前和之后,我们可以有特殊字符
You're looking for: (\\S+)\\s*\\=\\s*(\\S+)
您正在寻找:
(\\S+)\\s*\\=\\s*(\\S+)
\\S
matches non-whitespace, and will allow for ./\\#@&
, etc in the capture group. \\S
匹配非空格,并允许捕获组中的./\\#@&
等。
\\w
matches only word characters, so this matches the last word before an equals and the first word after, with or without whitespace between the = if you change the \\s+
to \\s*
\\w
仅匹配单词字符,因此,如果将\\s+
更改为\\s*
则它将匹配等于之前的最后一个单词和之后等于或不带有空格的第一个单词\\s*
Why it doesn't work is because it parses it left to right: When it finds any amount of whitespace \\s+
it begins sucking in all characters .*?
为什么它不起作用,是因为它从左到右解析:当发现任意数量的空格
\\s+
它开始吸收所有字符.*?
until it finds a " ="
. 直到找到
" ="
。 So it will match the whole line before the " ="
after any whitespace character. 因此,它将匹配任何空白字符后的
" ="
之前的整行。
The lazy evaluation doesn't go back to find the smallest set it can, it just goes until it reaches the first complete match and stops: 惰性评估不会返回找到可以找到的最小集合,它会一直持续到第一个完全匹配并停止:
dog dog dog dog = cat cat cat cat
a lazy capture of \\s+(.*?)\\s+=
gives: us dog dog dog
, because that's the first match it got: starting from a " "
after the first dog and ending at the first " ="
it finds. 懒惰地捕获
\\s+(.*?)\\s+=
得到:us dog dog dog
,因为这是它的第一个匹配项:从找到的第一条狗之后的" "
开始,到找到的第一条" ="
结束。 The second group does what you expect, because it doesn't have the extra requirement that it ends on a space with an equals sign. 第二组符合您的期望,因为它没有额外的要求,即它必须以等号结尾。
After the =
, the lazy will limit it to only the first word, as that is the first point at which it gets a match. 在
=
,惰性对象将其限制为仅第一个单词,因为这是它获得匹配的第一个点。 A greedy version would continue sucking in characters and find the longest string which ends in \\s+
. 贪婪的版本会继续吸收字符并找到以
\\s+
结尾的最长字符串。
tl;dr : lazy evaluation won't go back to find the smallest match, it will grab the first match when parsing from left to right. tl; dr :惰性评估不会返回找到最小的匹配项,从左到右进行解析时,它将获取第一个匹配项。
d+?og
will match ddddddog
in it's entirety, as it needed to gobble all the other d
s to match the first d
with the og
and it's too lazy to go back and see if it really needed to eat all those extra characters. d+?og
会完全匹配ddddddog
,因为它需要吞噬所有其他d
来与og
匹配第一个d
,而且懒得回头看看是否真的需要吃掉所有这些多余的字符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.