简体   繁体   English

Python正则表达式-正向前进

[英]python regex- positive lookahead

str='filename=1817616353&realname=Arguments%20for%20&%20against%20protection%20.pdf&code2=pds'
ptn='(?<=realname=).+(?=&)'
re.search(ptn,str).group()

well, when i run this code i'm expecting to get 好吧,当我运行这段代码时,我期望得到

'Arguments%20for%20'

as the match, but instead it gives me 作为比赛,但相反却给了我

'Arguments%20for%20&%20against%20protection%20.pdf'

i thought the match should occur at the first occurrence of '&' , which is right after 'for%20' part, so i have no idea why it's going all the way down to 'pdf' . 我认为匹配项应该在'&'的第一次出现时发生,这在'for%20'部分之后,因此我不知道为什么它会一直下降到'pdf' what am i doing wrong? 我究竟做错了什么?

Your assumption that the first occurrence of & would match is fundamentally wrong. 您认为&的第一次匹配将是根本错误的。

.+ means match as many as possible of any character (except newline). .+表示要尽可能匹配任何字符(换行符除外)。 Thus this causes anything after it to be matched at the last possible position. 因此,这导致其后的任何内容在最后可能的位置进行匹配。

A common fix for "I want as few as possible" is to use a greedy quantifier .+? 对“我想要的越少越好”的常见解决方法是使用贪婪的量词.+? which means match as few as possible but it could still end up matching things you don't want. 这意味着尽可能少地匹配但最终仍可能匹配您不想要的东西。

If you really mean "match the first possible & " then the expression you should repeat before it is "anything except & ". 如果您真正的意思是“匹配第一个可能的& ”,那么您应该在“ & ”之外的任何内容之前重复该表达式。

ptn=r'(?<=realname=)[^&]+(?=&)'

(Notice also the use of an r'...' string. It doesn't make any difference here, but it's another common newbie error -- you want backslashes in your regex and don't understand why Python is losing them.) (还请注意使用r'...'字符串。这里没有任何区别,但这是另一个常见的新手错误-您想在正则表达式中使用反斜杠,却不明白Python为什么丢失了它们。)

This is basically a restatement of the other answer on this page but hopefully easier for a beginner to digest. 基本上,这是此页面上其他答案的重述,但希望对于初学者来说更容易理解。

Use a negated character class instead of .+ : 使用否定的字符类而不是.+

In [5]: ptn='(?<=realname=)[^&]+(?=&)'

In [6]: re.search(ptn,str).group()
Out[6]: 'Arguments%20for%20'

Although you can use a non greedy quantifier by adding ? 尽管您可以通过添加?使用非贪婪量词? at the trailing of .* , but using a negated character class will give you a better performance in this case: .*的结尾处,但是使用否定的字符类在这种情况下将为您提供更好的性能:

In [7]: ptn='(?<=realname=).+?(?=&)'

In [9]: %timeit re.search(ptn,str).group()
1000000 loops, best of 3: 1.46 us per loop

In [10]: ptn='(?<=realname=)[^&]+(?=&)'

In [11]: %timeit re.search(ptn,str).group()
1000000 loops, best of 3: 1.18 us per loop

For more info read the following post regard the difference between non-greedy quantifier and negated character classes. 有关更多信息,请阅读以下文章,关注非贪婪量词和否定字符类之间的区别。 Which would be better non-greedy regex or negated character class? 非贪婪的正则表达式或否定的字符类哪个更好?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM