[英]python regex- positive lookahead
str='filename=1817616353&realname=Arguments%20for%20&%20against%20protection%20.pdf&code2=pds'
ptn='(?<=realname=).+(?=&)'
re.search(ptn,str).group()
well, when i run this code i'm expecting to get 好吧,当我运行这段代码时,我期望得到
'Arguments%20for%20'
as the match, but instead it gives me 作为比赛,但相反却给了我
'Arguments%20for%20&%20against%20protection%20.pdf'
i thought the match should occur at the first occurrence of '&'
, which is right after 'for%20'
part, so i have no idea why it's going all the way down to 'pdf'
. 我认为匹配项应该在
'&'
的第一次出现时发生,这在'for%20'
部分之后,因此我不知道为什么它会一直下降到'pdf'
。 what am i doing wrong? 我究竟做错了什么?
Your assumption that the first occurrence of &
would match is fundamentally wrong. 您认为
&
的第一次匹配将是根本错误的。
.+
means match as many as possible of any character (except newline). .+
表示要尽可能匹配任何字符(换行符除外)。 Thus this causes anything after it to be matched at the last possible position. 因此,这导致其后的任何内容在最后可能的位置进行匹配。
A common fix for "I want as few as possible" is to use a greedy quantifier .+?
对“我想要的越少越好”的常见解决方法是使用贪婪的量词
.+?
which means match as few as possible but it could still end up matching things you don't want. 这意味着尽可能少地匹配,但最终仍可能匹配您不想要的东西。
If you really mean "match the first possible &
" then the expression you should repeat before it is "anything except &
". 如果您真正的意思是“匹配第一个可能的
&
”,那么您应该在“ &
”之外的任何内容之前重复该表达式。
ptn=r'(?<=realname=)[^&]+(?=&)'
(Notice also the use of an r'...'
string. It doesn't make any difference here, but it's another common newbie error -- you want backslashes in your regex and don't understand why Python is losing them.) (还请注意使用
r'...'
字符串。这里没有任何区别,但这是另一个常见的新手错误-您想在正则表达式中使用反斜杠,却不明白Python为什么丢失了它们。)
This is basically a restatement of the other answer on this page but hopefully easier for a beginner to digest. 基本上,这是此页面上其他答案的重述,但希望对于初学者来说更容易理解。
Use a negated character class instead of .+
: 使用否定的字符类而不是
.+
:
In [5]: ptn='(?<=realname=)[^&]+(?=&)'
In [6]: re.search(ptn,str).group()
Out[6]: 'Arguments%20for%20'
Although you can use a non greedy quantifier by adding ?
尽管您可以通过添加
?
使用非贪婪量词?
at the trailing of .*
, but using a negated character class will give you a better performance in this case: 在
.*
的结尾处,但是使用否定的字符类在这种情况下将为您提供更好的性能:
In [7]: ptn='(?<=realname=).+?(?=&)'
In [9]: %timeit re.search(ptn,str).group()
1000000 loops, best of 3: 1.46 us per loop
In [10]: ptn='(?<=realname=)[^&]+(?=&)'
In [11]: %timeit re.search(ptn,str).group()
1000000 loops, best of 3: 1.18 us per loop
For more info read the following post regard the difference between non-greedy quantifier and negated character classes. 有关更多信息,请阅读以下文章,关注非贪婪量词和否定字符类之间的区别。 Which would be better non-greedy regex or negated character class?
非贪婪的正则表达式或否定的字符类哪个更好?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.