Python正则表达式-正向前进

Question

str='filename=1817616353&realname=Arguments%20for%20&%20against%20protection%20.pdf&code2=pds'
ptn='(?<=realname=).+(?=&)'
re.search(ptn,str).group()

well, when i run this code i'm expecting to get 好吧，当我运行这段代码时，我期望得到

'Arguments%20for%20'

as the match, but instead it gives me 作为比赛，但相反却给了我

'Arguments%20for%20&%20against%20protection%20.pdf'

i thought the match should occur at the first occurrence of '&' , which is right after 'for%20' part, so i have no idea why it's going all the way down to 'pdf' . 我认为匹配项应该在'&'的第一次出现时发生，这在'for%20'部分之后，因此我不知道为什么它会一直下降到'pdf' 。 what am i doing wrong? 我究竟做错了什么？

Answer 1

Your assumption that the first occurrence of & would match is fundamentally wrong. 您认为&的第一次匹配将是根本错误的。

.+ means match as many as possible of any character (except newline). .+表示要尽可能匹配任何字符（换行符除外）。 Thus this causes anything after it to be matched at the last possible position. 因此，这导致其后的任何内容在最后可能的位置进行匹配。

A common fix for "I want as few as possible" is to use a greedy quantifier .+? 对“我想要的越少越好”的常见解决方法是使用贪婪的量词.+? which means match as few as possible but it could still end up matching things you don't want. 这意味着尽可能少地匹配，但最终仍可能匹配您不想要的东西。

If you really mean "match the first possible & " then the expression you should repeat before it is "anything except & ". 如果您真正的意思是“匹配第一个可能的& ”，那么您应该在“ & ”之外的任何内容之前重复该表达式。

ptn=r'(?<=realname=)[^&]+(?=&)'

(Notice also the use of an r'...' string. It doesn't make any difference here, but it's another common newbie error -- you want backslashes in your regex and don't understand why Python is losing them.) （还请注意使用r'...'字符串。这里没有任何区别，但这是另一个常见的新手错误-您想在正则表达式中使用反斜杠，却不明白Python为什么丢失了它们。）

This is basically a restatement of the other answer on this page but hopefully easier for a beginner to digest. 基本上，这是此页面上其他答案的重述，但希望对于初学者来说更容易理解。

Answer 2

Use a negated character class instead of .+ : 使用否定的字符类而不是.+ ：

In [5]: ptn='(?<=realname=)[^&]+(?=&)'

In [6]: re.search(ptn,str).group()
Out[6]: 'Arguments%20for%20'

Although you can use a non greedy quantifier by adding ? 尽管您可以通过添加?使用非贪婪量词? at the trailing of .* , but using a negated character class will give you a better performance in this case: 在.*的结尾处，但是使用否定的字符类在这种情况下将为您提供更好的性能：

In [7]: ptn='(?<=realname=).+?(?=&)'

In [9]: %timeit re.search(ptn,str).group()
1000000 loops, best of 3: 1.46 us per loop

In [10]: ptn='(?<=realname=)[^&]+(?=&)'

In [11]: %timeit re.search(ptn,str).group()
1000000 loops, best of 3: 1.18 us per loop

For more info read the following post regard the difference between non-greedy quantifier and negated character classes. 有关更多信息，请阅读以下文章，关注非贪婪量词和否定字符类之间的区别。 Which would be better non-greedy regex or negated character class? 非贪婪的正则表达式或否定的字符类哪个更好？

Python正则表达式-正向前进

问题描述

2 个解决方案

解决方案1
1 2017-11-15 12:35:15

解决方案2
0 已采纳 2017-11-14 12:32:13

Python正则表达式-正向前进

问题描述

2 个解决方案

解决方案1 1 2017-11-15 12:35:15

解决方案2 0 已采纳 2017-11-14 12:32:13

解决方案1
1 2017-11-15 12:35:15

解决方案2
0 已采纳 2017-11-14 12:32:13