[英]Why does this python re only capture one digit?
I'm trying to use python RE module to capture specific digits of strings like '03'
in ' video [720P] [DHR] _sp03.mp4 '
. 我正在尝试使用python RE模块捕获' video [720P] [DHR] _sp03.mp4 '
'03'
等字符串的特定数字。
And what confused me is : 令我困惑的是:
when I use '.*\\D+(\\d+).*mp4'
, it succeed to capture both the two digits 03
, but when I use '.*\\D*(\\d+).*mp4'
, it only captured the rear digit 3
. 当我使用'.*\\D+(\\d+).*mp4'
,它成功捕获两个数字03
,但当我使用'.*\\D*(\\d+).*mp4'
,它只捕获了后方数字3
。
I know python uses a greedy mode as default, which means trying to match as much text as possible. 我知道python使用贪婪模式作为默认模式,这意味着尝试匹配尽可能多的文本。 Considering this, I think *
and +
after the \\D
should behave samely. 考虑到这一点,我想*
和+
后\\D
应该相同则表现。 So where am I wrong? 那我在哪里错了? What leads to this difference? 是什么导致了这种差异? Can anyone help explain it? 谁能帮忙解释一下呢?
BTW: I used online regex tester for python: https://regex101.com/#python BTW:我使用python的在线正则表达式测试器: https : //regex101.com/#python
What makes the difference is not the \\D+
but the first .*
是什么造成差异不是\\D+
而是第一个.*
Now in regex .*
is greedy and tries to match as much as characters as possible as it can 现在在正则表达式.*
是贪婪的,尽可能地匹配尽可能多的字符
So when you write 所以当你写作
.*\D*(\d+).*mp4
The .*
will match as much as it can. .*
将尽可能多地匹配。 That is if we try to break it down, it would look like 那就是如果我们试图将其分解,那就像是
video [720P] [DHR] _sp03.mp4
|
.*
video [720P] [DHR] _sp03.mp4
|
.*
.....
video [720P] [DHR] _sp03.mp4
|
.* That is 0 is also matched by the .
video [720P] [DHR] _sp03.mp4
|
\D* Since the quantfier is zero or more, it matches nothing here without advancing to 3
video [720P] [DHR] _sp03.mp4
|
(\d+)
video [720P] [DHR] _sp03.mp4
|
.*
video [720P] [DHR] _sp03.mp4
|
mp4
Now when we use the \\D+
, the matching changes a bit, because the regex engine will be forced to match at least 1 non digit( \\D+
) before the digits ( (\\d+)
). 现在,当我们使用\\D+
,匹配会稍微改变,因为正则表达式引擎将被强制匹配至少1个非数字( \\D+
)之前的数字( (\\d+)
)。 This will be consume the p
which is the last non digit before the digits 这将消耗p
,这是数字之前的最后一位非数字
That is 那是
.*
will try to match as much as it can till p
, so that the \\D+
can match at least one non digit which is p
and \\d+
will match you the 03
part .*
会尝试尽可能多地匹配到p
,这样\\D+
可以匹配至少一个非数字,即p
和\\d+
将匹配你的03
部分
video [720P] [DHR] _sp03.mp4
|
.*
video [720P] [DHR] _sp03.mp4
|
.*
.....
video [720P] [DHR] _sp03.mp4
|
\D+ The first non digit. Forced to match at least once.
video [720P] [DHR] _sp03.mp4
|
(\d+)
video [720P] [DHR] _sp03.mp4
|
(\d+)
video [720P] [DHR] _sp03.mp4
|
.*
video [720P] [DHR] _sp03.mp4
|
mp4
The problem is with \\D*. 问题是\\ D *。 The '+' is for one or more and '*' is for zero or more. '+'表示一个或多个,'*'表示零或更多。
As you have used '.*' in starting it become greedy and takes till ' video [720P] [DHR] _sp0' where in '\\D+' case it quits at ' video [720P] [DHR] _s' leaving 'p' for \\D+ 正如您在开始时使用'。*'变得贪婪并直到'视频[720P] [DHR] _sp0'在'\\ D +'的情况下它退出'视频[720P] [DHR] _s'离开'p'为\\ D +
>>> import re
>>> a = " video [720P] [DHR] _sp03.mp4 "
>>> p1 = re.compile('.*\D+(\d+).*mp4')
>>> p2 = re.compile('.*\D*(\d+).*mp4')
>>> re.findall(p1,a)
['03']
>>> re.findall(p2,a)
['3']
>>> a
' video [720P] [DHR] _sp03.mp4 '
>>> p3 = re.compile('(.*)(\D*)(\d+)(.*)mp4')
>>> re.findall(p3,a)
[(' video [720P] [DHR] _sp0', '', '3', '.')]
>>> p4 = re.compile('(.*)(\D+)(\d+)(.*)mp4')
>>> re.findall(p4,a)
[(' video [720P] [DHR] _s', 'p', '03', '.')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.