简体   繁体   English

为什么我不能在python中匹配正则表达式的最后一部分?

[英]Why can't I match the last part of my regular expression in python?

I want to match a sentence with an optional end 'other (\\\\w+)' . 我想匹配一个带有可选结尾'other (\\\\w+)'的句子。 For example, the regular expression should match both sentence as follows and extract the word 'things': 例如,正则表达式应该如下匹配两个句子并提取单词'things':

  • The apple and other things. 苹果和其他东西。
  • The apple is big. 苹果很大。

I wrote a regular expression as below. 我写了一个正则表达式如下。 However, I got a result (None,) . 但是,我得到了一个结果(None,) If I remove the last ? 如果我删除最后一个? . I will get the right answer. 我会得到正确的答案。 Why? 为什么?

>>> re.search('\w+(?: other (\\w+))?', 'A and other things').groups()
(None,)
>>> re.search('\w+(?: other (\\w+))', 'A and other things').groups()
('things',)

If you use: 如果您使用:

re.search(r'\w+(?: other (\w+))?', 'A and other things').group()

You will see what is happening. 你会看到发生了什么。 Since anything after \\w+ is optional your search matches first word A . 由于后什么\\w+是可选的search第一个词匹配A

As per official documentation : 根据官方文件

.groups()

Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. 返回包含匹配的所有子组的元组,从1到多个组都在模式中。

And your search call doesn't return any subgroup hence you get: 并且您的search调用不会返回任何子组,因此您得到:

re.search(r'\w+(?: other (\w+))?', 'A and other things').groups()
(None,)

To solve your problem you can use this alternation based regex: 要解决您的问题,您可以使用此基于交替的正则表达式:

r'\w+(?: other (\w+)|$)'

Examples: 例子:

>>> re.search(r'\w+(?: other (\w+)|$)', 'A and other things').group()
'and'
>>> re.search(r'\w+(?: other (\w+)|$)', 'The apple is big').group()
'big'

The rule for regular expression searches is that they produce the leftmost longest match. 正则表达式搜索的规则是它们产生最左边的最长匹配。 Yes, it tries to give you longer matches if possible, but most importantly, when it finds the first successful match, it will stop looking further. 是的,如果可能的话,它会尝试给你更长的匹配,但最重要的是,当它找到第一个成功的匹配时,它将停止进一步查看。

In the first regular expression, the leftmost point where \\w+ matches is A . 在第一个正则表达式中, \\w+匹配的最左边的点是A The optional portion doesn't match there, so it's done. 可选部分与那里不匹配,所以就完成了。

In the second regular expression, the parenthesized expression is mandatory, so A is not a match. 在第二个正则表达式中,带括号的表达式是必需的,因此A不匹配。 Therefore, it continues looking. 因此,它继续寻找。 The \\w+ matches and , then the second \\\\w+ matches things . \\w+匹配and ,然后第二\\\\w+匹配things


Note that for regular expressions in Python, especially those containing backslashes, it's a good idea to write them using r'raw strings' . 请注意,对于Python中的正则表达式,尤其是那些包含反斜杠的表达式,使用r'raw strings'编写它们是个好主意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM