简体   繁体   English

re.findall 没有返回完整匹配项?

[英]re.findall not returning full match?

I have a file that includes a bunch of strings like "size=XXX;".我有一个文件,其中包含一堆字符串,如“size=XXX;”。 I am trying Python's re module for the first time and am a bit mystified by the following behavior: if I use a pipe for 'or' in a regular expression, I only see that bit of the match returned.我是第一次尝试 Python 的re模块,对以下行为感到有点困惑:如果我在正则表达式中使用 pipe 作为“或”,我只会看到返回的那部分匹配项。 Eg:例如:

>>> myfile = open('testfile.txt', 'r').read()
>>> re.findall('size=50;', myfile)
['size=50;', 'size=50;', 'size=50;', 'size=50;']

>>> re.findall('size=51;', myfile)
['size=51;', 'size=51;', 'size=51;']

>>> re.findall('size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']

>>> re.findall(r'size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']

The "size=" part of the match is gone (Yet it is certainly used in the search, otherwise there would be more results).匹配的“size=”部分没有了(但肯定是在搜索中用到的,不然结果会多)。 What am I doing wrong?我究竟做错了什么?

The problem you have is that if the regex that re.findall tries to match captures groups (ie the portions of the regex that are enclosed in parentheses), then it is the groups that are returned, rather than the matched string.您遇到的问题是,如果re.findall尝试匹配的正则表达式捕获组(即括号中的正则表达式部分),则返回的是组,而不是匹配的字符串。

One way to solve this issue is to use non-capturing groups (prefixed with ?: ).解决此问题的一种方法是使用非捕获组(以?:为前缀)。

>>> import re
>>> s = 'size=50;size=51;'
>>> re.findall('size=(?:50|51);', s)
['size=50;', 'size=51;']

If the regex that re.findall tries to match does not capture anything, it returns the whole of the matched string.如果re.findall尝试匹配的正则表达式没有捕获任何内容,它将返回整个匹配的字符串。

Although using character classes might be the simplest option in this particular case, non-capturing groups provide a more general solution.尽管在这种特殊情况下使用字符类可能是最简单的选择,但非捕获组提供了更通用的解决方案。

When a regular expression contains parentheses, they capture their contents to groups, changing the behaviour of findall() to only return those groups.当正则表达式包含括号时,它们将其内容捕获到组中,将findall()的行为更改为仅返回这些组。 Here's the relevant section from the docs :这是文档中的相关部分:

(...)

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group;匹配括号内的任何正则表达式,并指示组的开始和结束; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \\number special sequence, described below.在执行匹配后可以检索组的内容,并且可以稍后在字符串中使用\\number特殊序列进行匹配,如下所述。 To match the literals '(' or ')' , use \\( or \\) , or enclose them inside a character class: [(] [)] .要匹配文字'('')' ,请使用\\(\\) ,或将它们包含在字符类中: [(] [)]

To avoid this behaviour, you can use a non-capturing group:为避免这种行为,您可以使用非捕获组:

>>> print re.findall(r'size=(?:50|51);',myfile)
['size=51;', 'size=51;', 'size=51;', 'size=50;', 'size=50;', 'size=50;', 'size=50;']

Again, from the docs:再次,来自文档:

(?:...)

A non-capturing version of regular parentheses.常规括号的非捕获版本。 Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.匹配括号内的任何正则表达式,但在执行匹配或稍后在模式中引用后,无法检索与组匹配的子字符串。

In some cases, the non-capturing group is not appropriate, for example with regex which detects repeated words (example from python docs )在某些情况下,非捕获组是不合适的,例如使用正则表达式检测重复的单词(来自python 文档的示例)

r'(\b\w+)\s+\1'

In this situation to get whole match one can use在这种情况下,要获得整场比赛,可以使用

[groups[0] for groups in re.findall(r'((\b\w+)\s+\2)', text)]

Note that \\1 has changed to \\2 .请注意, \\1已更改为\\2

'size=(50|51);' means you are looking for size=50 or size=51 but only matching the 50 or 51 part (note the parentheses), therefore it does not return the sign= .意味着您正在寻找size=50size=51但只匹配5051部分(注意括号),因此它不会返回sign=

If you want the sign= returned, you can do:如果您希望返回sign= ,您可以执行以下操作:

re.findall('(size=50|size=51);',myfile)

I think what you want is using [] instead of ().我认为你想要的是使用 [] 而不是 ()。 [] indicating set of character while () indicating group match. [] 表示字符集,而 () 表示组匹配。 Try something like this:尝试这样的事情:

print re.findall('size=5[01];', myfile)

As others mentioned, the "problem" with re.findall is that it returns a list of strings/tuples-of-strings depending on the use of capture groups.正如其他人提到的, re.findall“问题”是它根据捕获组的使用返回字符串/字符串元组列表。 If you don't want to change the capture groups you're using (not to use character groups [] or non-capturing groups (?:) ), you can use finditer instead of findall .如果您不想更改正在使用的捕获组(不使用字符组[]或非捕获组(?:) ),则可以使用finditer而不是findall This gives an iterator of Match objects , instead of just strings.这给出了Match objects迭代器,而不仅仅是字符串。 So now you can fetch the full match, even when using capture groups:所以现在您可以获取完整的匹配项,即使在使用捕获组时也是如此:

import re

s = 'size=50;size=51;'
for m in re.finditer('size=(50|51);', s):
    print(m.group())

Will give:会给:

size=50;
size=51;

And if you need a list, similar to findall , you can use a list-comprehension:如果你需要一个列表,类似于findall ,你可以使用列表理解:

>>> [m.group() for m in re.finditer('size=(50|51);', s)]
['size=50;', 'size=51;']

Here is a clean solution: https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/ if the website dies here is the example (try on regex101.com):这是一个干净的解决方案: https ://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/ 如果网站在这里死掉就是这个例子(试试regex101.com):

regex: ^I like (dogs|penguins), but not (lions|tigers).$ try with: I like dogs, but not lions.正则表达式: ^I like (dogs|penguins), but not (lions|tigers).$尝试:我喜欢狗,但不喜欢狮子。 I like dogs, but not tigers.我喜欢狗,但不喜欢老虎。 I like penguins, but not lions.我喜欢企鹅,但不喜欢狮子。 I like penguins, but not tigers.我喜欢企鹅,但不喜欢老虎。

Match 1 Full match 2-29 I like dogs, but not lions.第 1 场 完整比赛 2-29 我喜欢狗,但不喜欢狮子。 Group 1. 9-13 dogs Group 2. 23-28 lions ... 1组. 9-13只狗 2组. 23-28只狮子...

but with regex: ^I like (?:dogs|penguins), but not (?:lions|tigers).$ Match 1 Full match 2-29 I like dogs, but not lions.但使用正则表达式: ^I like (?:dogs|penguins), but not (?:lions|tigers).$匹配 1 完全匹配 2-29 我喜欢狗,但不喜欢狮子。 Match 2 Full match 30-58 I like dogs, but not tigers.第 2 场 全场 30-58 我喜欢狗,但不喜欢老虎。 ... ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM