[英]Differences in re.findall and re.finditer — bug in Python 2.7 re module?
While demonstrating Python's regex functionality, I wrote a small program to compare the return values of re.search()
, re.findall()
and re.finditer()
. 在演示Python的正则表达式功能时,我编写了一个小程序来比较
re.search()
, re.findall()
和re.finditer()
的返回值。 I'm aware that re.search()
will only find one match per line and that re.findall()
only returns the matched substring(s) and not any location information. 我知道
re.search()
每行只会找到一个匹配项,而re.findall()
只返回匹配的子字符串,而不返回任何位置信息。 However, I was surprised see to see that the matched substring can differ between the three functions. 但是,我很惊讶地看到匹配的子字符串在三个函数之间可以不同。
Code ( available on GitHub ): 代码( 在GitHub上可用 ):
#! /usr/bin/env python
# -*- coding: utf-8 -*-
# License: CC-BY-NC-SA 3.0
import re
import codecs
# download kate_chopin_the_awakening_and_other_short_stories.txt
# from Project Gutenberg:
# http://www.gutenberg.org/ebooks/160.txt.utf-8
# with wget:
# wget http://www.gutenberg.org/ebooks/160.txt.utf-8 -O kate_chopin_the_awakening_and_other_short_stories.txt
# match for something o'clock, with valid numerical time or
# any English word with proper capitalization
oclock = re.compile(r"""
(
[A-Z]?[a-z]+ # word mit max. 1 capital letter
| 1[012] # 10,11,12
| [1-9] # 1,2,3,5,6,7,8,9
)
\s
o'clock""",
re.VERBOSE)
path = "kate_chopin_the_awakening_and_other_short_stories.txt"
print
print "re.search()"
print
print u"{:>6} {:>6} {:>6}\t{}".format("Line","Start","End","Match")
print u"{:=>6} {:=>6} {:=>6}\t{}".format('','','','=====')
with codecs.open(path,mode='r',encoding='utf-8') as f:
for lineno, line in enumerate(f):
atime = oclock.search(line)
if atime:
print u"{:>6} {:>6} {:>6}\t{}".format(lineno,
atime.start(),
atime.end(),
atime.group())
print
print "re.findall()"
print
print u"{:>6} {:>6} {:>6}\t{}".format("Line","Start","End","Match")
print u"{:=>6} {:=>6} {:=>6}\t{}".format('','','','=====')
with codecs.open(path,mode='r',encoding='utf-8') as f:
for lineno, line in enumerate(f):
times = oclock.findall(line)
if times:
print u"{:>6} {:>6} {:>6}\t{}".format(lineno,
'',
'',
' '.join(times))
print
print "re.finditer()"
print
print u"{:>6} {:>6} {:>6}\t{}".format("Line","Start","End","Match")
print u"{:=>6} {:=>6} {:=>6}\t{}".format('','','','=====')
with codecs.open(path,mode='r',encoding='utf-8') as f:
for lineno, line in enumerate(f):
times = oclock.finditer(line)
for m in times:
print u"{:>6} {:>6} {:>6}\t{}".format(lineno,
m.start(),
m.end(),
m.group())
and Output (tested on Python 2.7.3 and 2.7.5): 和输出(在Python 2.7.3和2.7.5上测试):
re.search()
Line Start End Match
====== ====== ====== =====
248 7 21 eleven o'clock
1520 24 35 one o'clock
1975 21 33 nine o'clock
2106 4 16 four o'clock
4443 19 30 ten o'clock
re.findall()
Line Start End Match
====== ====== ====== =====
248 eleven
1520 one
1975 nine
2106 four
4443 ten
re.finditer()
Line Start End Match
====== ====== ====== =====
248 7 21 eleven o'clock
1520 24 35 one o'clock
1975 21 33 nine o'clock
2106 4 16 four o'clock
4443 19 30 ten o'clock
What am I missing something here? 我在这里想念什么吗? Why doesn't
re.findall()
return the o'clock
bit? 为什么
re.findall()
返回o'clock
位?
According to re.findall
documentation : 根据
re.findall
文档 :
... If one or more groups are present in the pattern, return a list of groups ;
...如果模式中存在一个或多个组,则返回一个组列表; this will be a list of tuples if the pattern has more than one group.
如果模式包含多个组,则这将是一个元组列表。
The pattern
contains only one group; pattern
仅包含一组; findall
returns a list of the group. findall
返回该组的列表。
>>> import re
>>> re.findall('abc', 'abc')
['abc']
>>> re.findall('a(b)c', 'abc')
['b']
>>> re.findall('a(b)(c)', 'abc')
[('b', 'c')]
Using non-capturing version of parentheses: 使用非捕获版本的括号:
>>> re.findall('a(?:b)c', 'abc')
['abc']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.