re.findall 行为怪异

Question

源字符串是：

# Python 3.4.3
s = r'abc123d, hello 3.1415926, this is my book'

这是我的模式：

pattern = r'-?[0-9]+(\\.[0-9]*)?|-?\\.[0-9]+'

然而， re.search可以给我正确的结果：

m = re.search(pattern, s)
print(m)  # output: <_sre.SRE_Match object; span=(3, 6), match='123'>

re.findall只是转出一个空列表：

L = re.findall(pattern, s)
print(L)  # output: ['', '', '']

为什么re.findall不能给我预期的列表：

['123', '3.1415926']

Answer 1

这里有两点需要注意：

如果正则表达式模式包含捕获组，则re.findall返回捕获的文本
r'\\.' 模式中的部分匹配两个连续的字符， \和除换行符以外的任何字符。

请参阅findall参考：

如果模式中存在一个或多个组，则返回组列表； 如果模式有多个组，这将是一个元组列表。 空匹配包含在结果中，除非它们触及另一个匹配的开始。

请注意，要使re.findall只返回匹配值，您通常可以

删除多余的捕获组（例如(a(b)c) -> abc ）
将所有捕获组转换为非捕获组（即，将(替换为(?: ) ，除非存在引用模式中组值的反向引用（然后见下文）
改用re.finditer ( [x.group() for x in re.finditer(pattern, s)] )

在您的情况下， findall返回了所有捕获的空文本，因为您在r''字符串文字中有\\试图匹配文字\ 。

要匹配数字，您需要使用

-?\d*\.?\d+

正则表达式匹配：

-? - 可选减号
\d* - 可选数字
\.? - 可选的小数分隔符
\d+ - 1 个或多个数字。

看演示

这是IDEONE 演示：

import re
s = r'abc123d, hello 3.1415926, this is my book'
pattern = r'-?\d*\.?\d+'
L = re.findall(pattern, s)
print(L)

Answer 2

s = r'abc123d, hello 3.1415926, this is my book'
print re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s)

使用raw 模式时不需要转义两次。

输出： ['123', '3.1415926']

此外，返回类型将是一个字符串列表。 如果您希望返回类型为整数和浮点数，请使用map

import re,ast
s = r'abc123d, hello 3.1415926, this is my book'
print map(ast.literal_eval,re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s))

输出： [123, 3.1415926]

Answer 3

只是为了解释为什么您认为search返回了您想要的内容而findall没有？

搜索返回一个SRE_Match对象，其中包含一些信息，例如：

string ：属性包含传递给搜索函数的字符串。
re : 搜索功能中使用的REGEX对象。
groups() ：由REGEX中的捕获组捕获的字符串列表。
group(index) ：使用index > 0按组检索捕获的字符串。
group(0) ：返回由REGEX匹配的字符串。

当它找到第一个马赫构建SRE_Match对象并返回它时， search停止，检查以下代码：

import re

s = r'abc123d'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.search(pattern, s)
print(m.string)  # 'abc123d'
print(m.group(0))  # REGEX matched 123
print(m.groups())  # there is only one group in REGEX (\.[0-9]*) will  empy string tgis why it return (None,) 

s = ', hello 3.1415926, this is my book'
m2 = re.search(pattern, s)  # ', hello 3.1415926, this is my book'
print(m2.string)    # abc123d
print(m2.group(0))  # REGEX matched 3.1415926
print(m2.groups())  # the captured group has captured this part '.1415926'

findall的行为不同，因为它不仅会在找到第一个 mach 时停止，它会一直提取直到文本结尾，但是如果REGEX包含至少一个捕获组，则findall不会返回匹配的字符串，而是返回捕获的字符串捕获组：

import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m)  # ['', '.1415926']

当找到第一个马赫时，第一个element返回'123'捕获组仅捕获'' ，但第二个element在第二个匹配中捕获'3.1415926'捕获组匹配此部分'.1415926' 。

如果要使findall返回匹配的字符串，则应将REGEX中的所有捕获组()设为非捕获组(?:) ：

import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m)  # ['123', '3.1415926']

re.findall 行为怪异

问题描述

3 个解决方案

解决方案1
28 2015-08-10 08:40:10

解决方案2
14 已采纳 2015-08-10 08:41:43

解决方案3
3 2019-10-06 13:53:40

re.findall 行为怪异

问题描述

3 个解决方案

解决方案1 28 2015-08-10 08:40:10

解决方案2 14 已采纳 2015-08-10 08:41:43

解决方案3 3 2019-10-06 13:53:40

解决方案1
28 2015-08-10 08:40:10

解决方案2
14 已采纳 2015-08-10 08:41:43

解决方案3
3 2019-10-06 13:53:40