简体   繁体   English

Python findall不返回期望值

[英]Python findall does not return expected values

I have some strings that contains info between two quotes like: 我有一些包含两个引号之间的信息的字符串,例如:

cc "1/11/2A" "1/20+21/1 1" "XX" 0

I am using re.findall('\\"*\\"', line) to match parts between quotes but doesn't work for some reason. 我正在使用re.findall('\\"*\\"', line)来匹配引号之间的部分,但由于某些原因无法正常工作。 I tried many other things but all I get is some empty list with nothing in it. 我尝试了许多其他事情,但是我得到的只是一个空列表,里面什么也没有。 What am I doing wrong ? 我究竟做错了什么 ?

You are matching 0 or more quotes followed by a quote. 您要匹配0个或多个引号,后接一个引号。 Use a negative character class instead: 改用否定字符类:

re.findall(r'"[^"]*"', line)

You may want to put a capturing group around the negative character class: 您可能需要围绕否定字符类放置一个捕获组:

re.findall(r'"([^"]*)"', line)

and now .findall() returns everything within quotes, not including the quotes themselves: 现在.findall()返回引号内的所有内容,不包括引号本身:

>>> import re
>>> re.findall(r'"([^"]*)"', 'cc "1/11/2A" "1/20+21/1 1" "XX" 0')
['1/11/2A', '1/20+21/1 1', 'XX']

The [^...] negative character class notation means: match any character that is not included in the set of characters named here. [^...]否定的字符类表示法是:匹配在此命名的字符集中包括的任何字符。 [^"] thus matches any character that is not a quote, neatly limiting the matched characters to everything that is within quotes. 因此[^"]匹配非引号的任何字符,从而将匹配的字符整洁地限制引号内的所有字符。

It should be r'"[^"]*"' . Your pattern matches one or more " characters in a row. 它应该是r'"[^"]*"' 。您的模式连续匹配一个或多个"字符。

In [4]: re.findall(r'"[^"]*"', line)
Out[4]: ['"1/11/2A"', '"1/20+21/1 1"', '"XX"']

It looks like you were expecting * to match "anything", the way it does in filename wildcards. 您似乎希望*匹配“任何内容”,这与文件名通配符中的方式相同。

But that's not what it means in regex. 但这不是正则表达式的含义。 It modifies the preceding expression, to match zero or more copies of that expression. 它修改前面的表达式,以匹配该表达式的零个或多个副本。

To get filename-style wildcard, you want to use .* . 要获取文件名样式的通配符,请使用.*

However, that won't actually work, because . 但是,这实际上不起作用,因为. matches anything—including " . So, it will grab everything up to the very last " character, leaving only that for the rest of the expression, meaning findall will find one big string instead of three small ones. 匹配任何内容,包括" 。”,它将捕获所有内容,直到最后一个"字符,只保留其余的表达式,这意味着findall将找到一个大字符串而不是三个小字符串。

You can fix that by making the repetition non-greedy, with .*? 您可以通过使用.*?将重复设为非贪婪来解决此问题.*? . This will match everything up to the first " . 这将匹配一切到第一个 "

So: 所以:

>>> re.findall('\".*?\"', line)
['"1/11/2A"', '"1/20+21/1 1"', '"XX"']

I think Martijn Pieters's answer is probably conceptually clearer; 我认为Martijn Pieters的答案在概念上可能更清晰; I've only offered this because I think this may be the way you were trying to attack the problem, and I wanted to show how you could have gotten there. 我之所以提供此服务,是因为我认为这可能是您尝试解决问题的方式,并且我想向您展示如何实现这一目标。

As a side note, regex code is much easier to read if you use raw strings, so you can get rid of the excess backslash escapes. 附带说明一下,如果使用原始字符串,则正则表达式代码更容易阅读,因此您可以摆脱多余的反斜杠转义符。 In this case, the backslashes are already unnecessary—you don't need to escape double-quotes in either a single-quoted string or a regex. 在这种情况下,反斜杠已经没有必要了-您无需在单引号字符串正则表达式中转义双引号。 But instead of trying to remember what does and doesn't need to be escaped by the Python parser so it can get to the regex parser, it's easier to just use raw strings. 但是与其试图记住Python解析器需要转义哪些内容和不需要转义什么,以便可以进入正则表达式解析器,不如使用原始字符串会更容易。 So: 所以:

>>> re.findall(r'".*?"', line)
['"1/11/2A"', '"1/20+21/1 1"', '"XX"']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM