简体   繁体   English

re.findall 行为怪异

[英]re.findall behaves weird

The source string is:源字符串是:

# Python 3.4.3
s = r'abc123d, hello 3.1415926, this is my book'

and here is my pattern:这是我的模式:

pattern = r'-?[0-9]+(\\.[0-9]*)?|-?\\.[0-9]+'

however, re.search can give me correct result:然而, re.search可以给我正确的结果:

m = re.search(pattern, s)
print(m)  # output: <_sre.SRE_Match object; span=(3, 6), match='123'>

re.findall just dump out an empty list: re.findall只是转出一个空列表:

L = re.findall(pattern, s)
print(L)  # output: ['', '', '']

why can't re.findall give me the expected list:为什么re.findall不能给我预期的列表:

['123', '3.1415926']

There are two things to note here:这里有两点需要注意:

  • re.findall returns captured texts if the regex pattern contains capturing groups in it如果正则表达式模式包含捕获组,则re.findall返回捕获的文本
  • the r'\\.' r'\\.' part in your pattern matches two consecutive chars, \ and any char other than a newline.模式中的部分匹配两个连续的字符, \和除换行符以外的任何字符。

See findall reference :请参阅findall参考

If one or more groups are present in the pattern, return a list of groups;如果模式中存在一个或多个组,则返回组列表; this will be a list of tuples if the pattern has more than one group.如果模式有多个组,这将是一个元组列表。 Empty matches are included in the result unless they touch the beginning of another match.空匹配包含在结果中,除非它们触及另一个匹配的开始。

Note that to make re.findall return just match values , you may usually请注意,要使re.findall只返回匹配值,您通常可以

  • remove redundant capturing groups (eg (a(b)c) -> abc )删除多余的捕获组(例如(a(b)c) -> abc
  • convert all capturing groups into non-capturing (that is, replace ( with (?: ) unless there are backreferences that refer to the group values in the pattern (then see below)将所有捕获组转换为非捕获组(即,将(替换为(?: ) ,除非存在引用模式中组值的反向引用(然后见下文)
  • use re.finditer instead ( [x.group() for x in re.finditer(pattern, s)] )改用re.finditer ( [x.group() for x in re.finditer(pattern, s)] )

In your case, findall returned all captured texts that were empty because you have \\ within r'' string literal that tried to match a literal \ .在您的情况下, findall返回了所有捕获的空文本,因为您在r''字符串文字中有\\试图匹配文字\

To match the numbers, you need to use要匹配数字,您需要使用

-?\d*\.?\d+

The regex matches:正则表达式匹配:

  • -? - Optional minus sign - 可选减号
  • \d* - Optional digits \d* - 可选数字
  • \.? - Optional decimal separator - 可选的小数分隔符
  • \d+ - 1 or more digits. \d+ - 1 个或多个数字。

See demo演示

Here is IDEONE demo :这是IDEONE 演示

import re
s = r'abc123d, hello 3.1415926, this is my book'
pattern = r'-?\d*\.?\d+'
L = re.findall(pattern, s)
print(L)
s = r'abc123d, hello 3.1415926, this is my book'
print re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s)

You dont need to escape twice when you are using raw mode .使用raw 模式时不需要转义两次。

Output: ['123', '3.1415926']输出: ['123', '3.1415926']

Also the return type will be a list of strings .此外,返回类型将是一个字符串列表。 If you want return type as integers and floats use map如果您希望返回类型为整数浮点数,请使用map

import re,ast
s = r'abc123d, hello 3.1415926, this is my book'
print map(ast.literal_eval,re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s))

Output: [123, 3.1415926]输出: [123, 3.1415926]

Just to explain why you think that search returned what you want and findall didn't?只是为了解释为什么您认为search返回了您想要的内容而findall没有?

search return a SRE_Match object that hold some information like:搜索返回一个SRE_Match对象,其中包含一些信息,例如:

  • string : attribute contains the string that was passed to search function. string :属性包含传递给搜索函数的字符串。
  • re : REGEX object used in search function. re : 搜索功能中使用的REGEX对象。
  • groups() : list of string captured by the capturing groups inside the REGEX . groups() :由REGEX中的捕获组捕获的字符串列表。
  • group(index) : to retrieve the captured string by group using index > 0 . group(index) :使用index > 0按组检索捕获的字符串。
  • group(0) : return the string matched by the REGEX . group(0) :返回由REGEX匹配的字符串。

search stops when It found the first mach build the SRE_Match Object and returning it, check this code:当它找到第一个马赫构建SRE_Match对象并返回它时, search停止,检查以下代码:

import re

s = r'abc123d'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.search(pattern, s)
print(m.string)  # 'abc123d'
print(m.group(0))  # REGEX matched 123
print(m.groups())  # there is only one group in REGEX (\.[0-9]*) will  empy string tgis why it return (None,) 

s = ', hello 3.1415926, this is my book'
m2 = re.search(pattern, s)  # ', hello 3.1415926, this is my book'
print(m2.string)    # abc123d
print(m2.group(0))  # REGEX matched 3.1415926
print(m2.groups())  # the captured group has captured this part '.1415926'

findall behave differently because it doesn't just stop when It find the first mach it keeps extracting until the end of the text, but if the REGEX contains at least one capturing group the findall don't return the matched string but the captured string by the capturing groups: findall的行为不同,因为它不仅会在找到第一个 mach 时停止,它会一直提取直到文本结尾,但是如果REGEX包含至少一个捕获组,则findall不会返回匹配的字符串,而是返回捕获的字符串捕获组:

import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m)  # ['', '.1415926']

the first element is return when the first mach was found witch is '123' the capturing group captured only '' , but the second element was captured in the second match '3.1415926' the capturing group matched this part '.1415926' .当找到第一个马赫时,第一个element返回'123'捕获组仅捕获'' ,但第二个element在第二个匹配中捕获'3.1415926'捕获组匹配此部分'.1415926'

If you want to make the findall return matched string you should make all capturing groups () in your REGEX a non capturing groups (?:) :如果要使findall返回匹配的字符串,则应将REGEX中的所有捕获组()设为非捕获组(?:)

import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m)  # ['123', '3.1415926']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM