简体   繁体   English

Python正则表达式返回匹配的最后一个字符的额外捕获组

[英]Python regular expression returning extra capture group for last character matched

I am trying to create a regular expression that will take strings and break them up into three groups: (1) Any one of a specific list of words at the beginning of a string. 我正在尝试创建一个正则表达式,它将接受字符串并将它们分成三组:(1)字符串开头的任何一个特定的单词列表。 (2) Any one of specific list of words at the end of a string. (2)字符串末尾的特定单词列表中的任何一个。 (3) all of the letters/whitespace in between these two matches. (3)这两个匹配之间的所有字母/空格。

As an example, I will use the following two strings: 作为一个例子,我将使用以下两个字符串:

'There was a cat in the house yesterday'
'Did you see a cat in the house today'

I would like the string to be broken up into capture groups so that the match object m.groups() will return the following for each string respectively: 我希望将字符串分解为捕获组,以便匹配对象m.groups()将分别为每个字符串返回以下内容:

('There', ' was a cat in the house ', 'yesterday')
('Did', ' you see a cat in the house ', 'today')

Originally, I came up with the following regex: 最初,我想出了以下正则表达式:

r = re.compile('^(There|Did) ( |[A-Za-z])+ (today|yesterday)$')

However this returns: 但是这会返回:

('There', 'e', 'yesterday')
('Did', 'e', 'today')

So it's only giving me the last character matched in the middle group. 所以它只给了我中间组中匹配的最后一个字符。 I learned that this doesn't work because capture groups will only return the last iteration that matched. 我了解到这不起作用,因为捕获组只会返回匹配的最后一次迭代。 So I put parentheses around the middle capture group as follows: 所以我将括号放在中间捕获组周围,如下所示:

r = re.compile('^(There|Did) (( |[A-Za-z])+) (today|yesterday)$')

But now, although it does at least capture the middle group, it is also returning an extra "e" character in m.groups() , ie: 但是现在,尽管它至少捕获了中间组,但它还在m.groups()返回了一个额外的“e”字符,即:

('There', 'was a cat in the house', 'e', 'yesterday')

... although I feel like this has something to do with backtracking, I can't figure out why it is happening. ...虽然我觉得这与回溯有关,但我无法弄清楚它为什么会发生。 Could someone please explain to me why I am getting this result, and how I can get the desired results? 有人可以向我解释为什么我得到这个结果,以及我如何能得到预期的结果?

You can simplify your current regex, and get the correct behavior, by replacing your middle capture group with the . 您可以通过将中间捕获组替换为中间捕获组来简化当前正则表达式并获得正确的行为. (dot) operator that will match any character, followed by the * (asterisk) operator to repeatedly match any character: (点)运算符将匹配任何字符,然后* (星号)运算符重复匹配任何字符:

import re

s1 = 'There was a cat in the house yesterday'
s2 = 'Did you see a cat in the house today'

x = re.compile("(There|Did)(.*)(today|yesterday)")
g1 = x.search(s1).groups()
g2 = x.search(s2).groups()

print(g1)
print(g2)

Produces this output: 生成此输出:

('There', ' was a cat in the house ', 'yesterday') ('那里','房子里有一只猫','昨天')
('Did', ' you see a cat in the house ', 'today') ('做','你看到房子里有一只猫','今天')

A repeated capturing group will only capture the last iteration. 重复捕获组仅捕获最后一次迭代。 Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data. 如果您对数据不感兴趣,请在重复组周围放置捕获组以捕获所有迭代或使用非捕获组。

source https://regex101.com/ 来源https://regex101.com/

And here is the re working as expected: 这是重新按预期工作:

^(There|Did) ([ A-Za-z]+) (today|yesterday)$
 r = re.compile('^(There|Did) (( |[A-Za-z])+) (today|yesterday)$')
                               ^ ^        ^

you have some unnecessary stuff. 你有一些不必要的东西。 Take those out and include spaces in your middle group: 拿出那些并在你的中间组中包含空格:

r = re.compile('^(There|Did) ([A-Za-z ]+) (today|yesterday)$')
                                     ^ space

EXAMPLE: 例:

>>> r = re.compile('^(There|Did) ([A-Za-z ]+) (today|yesterday)$')
>>> r.search('There was a a cat in the hosue yesterday').groups()
('There', 'was a a cat in the hosue', 'yesterday')

Also, take out the spaces in between your capture group if you want the spaces to be a part of your middle (2nd) group 此外,如果您希望空格成为中间(第二)组的一部分,请取出捕获组之间的空格

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM