简体   繁体   English

为什么这个python只捕获一个数字?

[英]Why does this python re only capture one digit?

I'm trying to use python RE module to capture specific digits of strings like '03' in ' video [720P] [DHR] _sp03.mp4 ' . 我正在尝试使用python RE模块捕获' video [720P] [DHR] _sp03.mp4 ' '03'等字符串的特定数字。

And what confused me is : 令我困惑的是:

when I use '.*\\D+(\\d+).*mp4' , it succeed to capture both the two digits 03 , but when I use '.*\\D*(\\d+).*mp4' , it only captured the rear digit 3 . 当我使用'.*\\D+(\\d+).*mp4' ,它成功捕获两个数字03 ,但当我使用'.*\\D*(\\d+).*mp4' ,它只捕获了后方数字3

I know python uses a greedy mode as default, which means trying to match as much text as possible. 我知道python使用贪婪模式作为默认模式,这意味着尝试匹配尽可能多的文本。 Considering this, I think * and + after the \\D should behave samely. 考虑到这一点,我想*+\\D应该相同则表现。 So where am I wrong? 那我在哪里错了? What leads to this difference? 是什么导致了这种差异? Can anyone help explain it? 谁能帮忙解释一下呢?

BTW: I used online regex tester for python: https://regex101.com/#python BTW:我使用python的在线正则表达式测试器: https//regex101.com/#python

What makes the difference is not the \\D+ but the first .* 是什么造成差异不是\\D+而是第一个.*

Now in regex .* is greedy and tries to match as much as characters as possible as it can 现在在正则表达式.*是贪婪的,尽可能地匹配尽可能多的字符

So when you write 所以当你写作

.*\D*(\d+).*mp4

The .* will match as much as it can. .*将尽可能多地匹配。 That is if we try to break it down, it would look like 那就是如果我们试图将其分解,那就像是

video [720P] [DHR] _sp03.mp4
|
.*

video [720P] [DHR] _sp03.mp4
 |
 .*
.....

video [720P] [DHR] _sp03.mp4
                      |
                      .* That is 0 is also matched by the .

video [720P] [DHR] _sp03.mp4
                      |
                      \D* Since the quantfier is zero or more, it matches nothing here without advancing to 3

video [720P] [DHR] _sp03.mp4
                       |
                      (\d+)

video [720P] [DHR] _sp03.mp4
                        |
                        .*

video [720P] [DHR] _sp03.mp4
                          |
                         mp4

Now when we use the \\D+ , the matching changes a bit, because the regex engine will be forced to match at least 1 non digit( \\D+ ) before the digits ( (\\d+) ). 现在,当我们使用\\D+ ,匹配会稍微改变,因为正则表达式引擎将被强制匹配至少1个非数字( \\D+ )之前的数字( (\\d+) )。 This will be consume the p which is the last non digit before the digits 这将消耗p ,这是数字之前的最后一位非数字

That is 那是

.* will try to match as much as it can till p , so that the \\D+ can match at least one non digit which is p and \\d+ will match you the 03 part .*会尝试尽可能多地匹配到p ,这样\\D+可以匹配至少一个非数字,即p\\d+将匹配你的03部分

video [720P] [DHR] _sp03.mp4
|
.*

video [720P] [DHR] _sp03.mp4
 |
 .*
.....

video [720P] [DHR] _sp03.mp4
                     |
                     \D+ The first non digit. Forced to match at least once.

video [720P] [DHR] _sp03.mp4
                      |
                      (\d+) 

video [720P] [DHR] _sp03.mp4
                       |
                      (\d+)

video [720P] [DHR] _sp03.mp4
                        |
                        .*

video [720P] [DHR] _sp03.mp4
                          |
                         mp4

The problem is with \\D*. 问题是\\ D *。 The '+' is for one or more and '*' is for zero or more. '+'表示一个或多个,'*'表示零或更多。

As you have used '.*' in starting it become greedy and takes till ' video [720P] [DHR] _sp0' where in '\\D+' case it quits at ' video [720P] [DHR] _s' leaving 'p' for \\D+ 正如您在开始时使用'。*'变得贪婪并直到'视频[720P] [DHR] _sp0'在'\\ D +'的情况下它退出'视频[720P] [DHR] _s'离开'p'为\\ D +

>>> import re
>>> a = " video [720P] [DHR] _sp03.mp4 "
>>> p1 = re.compile('.*\D+(\d+).*mp4')
>>> p2 = re.compile('.*\D*(\d+).*mp4')
>>> re.findall(p1,a)
['03']
>>> re.findall(p2,a)
['3']
>>> a
' video [720P] [DHR] _sp03.mp4 '
>>> p3 = re.compile('(.*)(\D*)(\d+)(.*)mp4')
>>> re.findall(p3,a)
[(' video [720P] [DHR] _sp0', '', '3', '.')]
>>> p4 = re.compile('(.*)(\D+)(\d+)(.*)mp4')
>>> re.findall(p4,a)
[(' video [720P] [DHR] _s', 'p', '03', '.')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM