简体   繁体   English

re.finditer 和 re.findall 之间的不同行为

[英]Different behavior between re.finditer and re.findall

I am using the following code:我正在使用以下代码:

CARRIS_REGEX=r'<th>(\d+)</th><th>([\s\w\.\-]+)</th><th>(\d+:\d+)</th><th>(\d+m)</th>'
pattern = re.compile(CARRIS_REGEX, re.UNICODE)
matches = pattern.finditer(mailbody)
findall = pattern.findall(mailbody)

But finditer and findall are finding different things.但是finditer和findall是在找不同的东西。 Findall indeed finds all the matches in the given string. Findall 确实找到了给定字符串中的所有匹配项。 But finditer only finds the first one, returning an iterator with only one element.但是finditer只找到第一个,返回一个只有一个元素的迭代器。

How can I make finditer and findall behave the same way?如何使 finditer 和 findall 的行为相同?

Thanks谢谢

I can't reproduce this here. 我在这里无法重现。 Have tried it with both Python 2.7 and 3.1. 在Python 2.7和3.1上都尝试过。

One difference between finditer and findall is that the former returns regex match objects whereas the other returns a tuple of the matched capturing groups (or the entire match if there are no capturing groups). finditerfindall之间的一个区别是,前者返回正则表达式匹配对象,而另一个则返回匹配的捕获组的元组(如果没有捕获组,则返回整个匹配项)。

So 所以

import re
CARRIS_REGEX=r'<th>(\d+)</th><th>([\s\w\.\-]+)</th><th>(\d+:\d+)</th><th>(\d+m)</th>'
pattern = re.compile(CARRIS_REGEX, re.UNICODE)
mailbody = open("test.txt").read()
for match in pattern.finditer(mailbody):
    print(match)
print()
for match in pattern.findall(mailbody):
    print(match)

prints 版画

<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>
<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>
<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>
<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>

('790', 'PR. REAL', '21:06', '04m')
('758', 'PORTAS BENFICA', '21:10', '09m')
('790', 'PR. REAL', '21:14', '13m')
('758', 'PORTAS BENFICA', '21:21', '19m')
('790', 'PR. REAL', '21:29', '28m')
('758', 'PORTAS BENFICA', '21:38', '36m')
('758', 'SETE RIOS', '21:49', '47m')
('758', 'SETE RIOS', '22:09', '68m')

If you want the same output from finditer as you're getting from findall , you need 如果要从finditer获得与从findall获得相同的输出,则需要

for match in pattern.finditer(mailbody):
    print(tuple(match.groups()))

re.findall(pattern.string) re.findall(pattern.string)

findall() returns all non-overlapping matches of pattern in string as a list of strings. findall()返回字符串中所有不重复的模式匹配作为字符串列表。

re.finditer() re.finditer()

finditer() returns callable object . finditer()返回可调用对象

In both functions, the string is scanned from left to right and matches are returned in order found. 在这两个函数中,从左到右扫描字符串,并按找到的顺序返回匹配项。

You can't make them behave the same way, because they're different. 您不能使它们的行为相同,因为它们是不同的。 If you really want to create a list of results from finditer , then you could use a list comprehension: 如果您确实想从finditer创建结果列表,则可以使用列表finditer

>>> [match for match in pattern.finditer(mailbody)]
[...]

In general, use a for loop to access the matches returned by re.finditer : 通常,使用for循环访问re.finditer返回的匹配re.finditer

>>> for match in pattern.finditer(mailbody):
...     ...

I get this example from Regular expression operations in Python 2.* Documentation and that example well described here in details with some modification. 我从Python 2 **中的 正则表达式操作中获得了该示例,并且对该示例进行了详细的修改,并对它进行了详细描述。 To explain whole example, let's get string type variable call, 为了说明整个示例,让我们获取字符串类型的变量调用,

text = "He was carefully disguised but captured quickly by police."

and the compile type regular expression pattern as, 编译类型正则表达式模式为

regEX = r"\w+ly"
pattern = re.compile(regEX)

\\w mean matches any word character (alphanumeric & underscore) , + mean matches 1 or more of the preceding token and the whole meaning is select any word which is end-up with ly . \\w mean 匹配任何单词字符(字母数字和下划线)+ mean 匹配前面的标记中的1个或多个,并且整个含义是选择任何以 ly 结尾的单词 There are only two 2 words('carefully' and 'quickly') which is satisfied the above regular expression. 只有两个2个单词(“仔细地”和“迅速地”)满足上述正则表达式。

Before move into re.findall() or re.finditer() , let's see what does re.search() mean in Python 2.* Documentation . 在进入re.findall()re.finditer()之前 ,让我们看看re.search()Python 2. * Documentation中的含义。

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. 扫描字符串以查找正则表达式模式产生匹配项的第一个位置,然后返回相应的MatchObject实例。 Return None if no position in the string matches the pattern; 如果字符串中没有位置与模式匹配,则返回None;否则返回No。 note that this is different from finding a zero-length match at some point in the string. 请注意,这与在字符串中的某个点找到零长度匹配不同。

Following code lines gives you the basic understand of re.search() . 以下代码行为您提供了对re.search()的基本了解。

search = pattern.search(text)
print(search)
print(type(search))

#output
<re.Match object; span=(7, 16), match='carefully'>
<class 're.Match'>

It will generate re.MatchObject of class type object which have 13 of supported methods and attributes according to Python 2.* Documentation . 根据Python 2. * Documentation ,它将生成类类型对象的re.MatchObject ,它具有13种受支持的方法和属性。 This span() method consist with the start and end points(7 and 16 present in the above example) of the matched word in text variable. 这个span()方法由text变量中匹配单词的起点和终点(在上面的示例中为7和16)组成。 re.search() method only consider about the very first match, otherwise return None . re.search()方法仅考虑第一个匹配项,否则返回None

Let's move into the question, before that see what does re.finditer() mean in Python 2.* Documentation . 让我们进入一个问题,在此之前先看看re.finditer()Python 2. * Documentation中的含义。

Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string. 返回一个迭代器,该迭代器在字符串的RE模式的所有非重叠匹配上产生MatchObject实例。 The string is scanned left-to-right, and matches are returned in the order found. 从左到右扫描该字符串,并以找到的顺序返回匹配项。 Empty matches are included in the result. 空匹配项包含在结果中。

Coming next code lines gives you the basic understand of re.finditer() . 接下来的代码行使您对re.finditer()有了基本的了解。

finditer = pattern.finditer(text)
print(finditer)
print(type(finditer))

#output
<callable_iterator object at 0x040BB690>
<class 'callable_iterator'>

The above example gives us the Iterator Objects which need to be loop. 上面的示例为我们提供了需要循环的迭代器对象 This is obviously not the result we want. 这显然不是我们想要的结果。 Let's loop finditer and see what's inside this Iterator Objects . 让我们循环finditer ,看看Iterator Objects内部有什么。

for anObject in finditer:
    print(anObject)
    print(type(anObject))
    print()

#output
<re.Match object; span=(7, 16), match='carefully'>
<class 're.Match'>

<re.Match object; span=(40, 47), match='quickly'>
<class 're.Match'>

This results are much similar to the re.search() result which we get earlier. 此结果与我们之前获得的re.search()结果非常相似。 But we can see the new result in above output, <re.Match object; span=(40, 47), match='quickly'> 但是我们可以在上面的输出<re.Match object; span=(40, 47), match='quickly'>看到新结果<re.Match object; span=(40, 47), match='quickly'> <re.Match object; span=(40, 47), match='quickly'> . <re.Match object; span=(40, 47), match='quickly'> As I mention earlier in Python 2.* Documentation , re.search() will scan through string looking for the first location where the regular expression pattern produces a match and re.finditer() will scan through string looking for all the locations where the regular expression pattern produces matches and return more details than re.findall() method. 正如我之前在Python 2. * Documentation中提到的那样, re.search()扫描字符串以查找正则表达式模式产生匹配的第一个位置,re.finditer()扫描字符串以查找所有位置。正则表达式模式产生匹配项,并且比re.findall()方法返回更多详细信息。

Here what does re.findall() mean in Python 2.* Documentation . 这是re.findall()Python 2. * Documentation中的含义。

Return all non-overlapping matches of pattern in string, as a list of strings. 返回字符串中模式的所有非重叠匹配项,作为字符串列表。 The string is scanned left-to-right, and matches are returned in the order found. 从左到右扫描该字符串,并以找到的顺序返回匹配项。 If one or more groups are present in the pattern, return a list of groups; 如果该模式中存在一个或多个组,则返回一个组列表;否则,返回一个列表。 this will be a list of tuples if the pattern has more than one group. 如果模式包含多个组,则这将是一个元组列表。 Empty matches are included in the result. 空匹配项包含在结果中。

Let's understand what happen in re.findall() . 让我们了解一下re.findall()中发生的情况。

findall = pattern.findall(text)
print(findall)
print(type(findall))

#output
['carefully', 'quickly']
<class 'list'>

This output only gives us the matched words in text variable, otherwise return an empty list . 此输出仅为我们提供text变量中匹配的单词,否则返回一个空列表 That list in the output which is similar to the match attribute in re.MatchObject . 输出中的列表类似于re.MatchObject中match属性。

Here is the full code and I tried in Python 3.7 . 这是完整的代码,我在Python 3.7中尝试过。

import re

text = "He was carefully disguised but captured quickly by police."

regEX = r"\w+ly"
pattern = re.compile(regEX)

search = pattern.search(text)
print(search)
print(type(search))
print()

findall = pattern.findall(text)
print(findall)
print(type(findall))
print()

finditer = pattern.finditer(text)
print(finditer)
print(type(finditer))
print()
for anObject in finditer:
    print(anObject)
    print(type(anObject))
    print()

I came here trying to get a string from my .finditer() 's regex results我来到这里试图从我的.finditer()的正则表达式结果中获取一个字符串

The solution was practically that I needed to create at least one group , which enabled fetching it from the group dict解决方案实际上是我需要创建至少一个组,这样才能从组 dict 中获取它

-     yield from zip(re.finditer(r"\w+", line) ...
+     yield from zip(re.finditer(r"(\w+)", line) ...
...
-     block.(miscellaneous attempts)
+     block.group(1)

Make use of finditer() when you are extracting from a large file since it will return an iterator object which helps in saving memory on the other hand findall() returns a list.从大文件中提取时使用 finditer(),因为它将返回一个迭代器 object,这有助于保存 memory 另一方面 findall() 返回一个列表。 And finditer() will extract differently than findall(). finditer() 的提取方式与 findall() 不同。

For example:例如:


    text_to_search = '''
    abcdefghijklmnopqurtuvwxyz
    ABCDEFGHIJKLMNOPQRSTUVWXYZ\s
    321-555-4321
    1234567890
    Ha HaHa
    MetaCharacters (Need to be escaped):
    . ^ $ * + ? { } [ ] \ | ( )
    khanafsaan11.com
    321-555-4321
    123.555.1234
    123*555*-1234
    123.555.1234
    800-555-1234
    900-555-1234
    Mr. Schafer
    Mr Smith
    Ms Davis
    Mrs. Robinson
    Mr. T
    Mr_hello
    '''
    pattern=re.compile(r'M(r|rs|s)\.? [A-Z][a-z]*')
    print(list(pattern.finditer(text_to_search))) #converted to list
    print(pattern.findall(text_to_search))

Output: Output:


    ['r', 'r', 's', 'rs', 'r'] 
    [, , , , ]

And you can get the output like findall() from finditer() output as following你可以从 finditer() output 得到像 findall() 这样的 output 如下


    for obj in pattern.finditer(text_to_search):
        print(obj.group()) #group() is an attribute of re.Match object
    #ouput
    Mr. Schafer
    Mr Smith
    Ms Davis
    Mrs. Robinson
    Mr. T

finditer() returns iterator object, finditer() helps with memory efficency its based on the generators. finditer() 返回迭代器 object,finditer() 有助于提高 memory 基于生成器的效率。

def my_ranger(max_num): def my_ranger(max_num):

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么 re.findall() 给我的结果与 Python 中的 re.finditer() 不同? - Why does re.findall() give me different results than re.finditer() in Python? 在正则表达式中使用组时 re.findall() 和 re.finditer() 之间的区别? - Difference between re.findall() and re.finditer() when using groups in regex? re.findall和re.finditer的区别-Python 2.7 re模块中的错误? - Differences in re.findall and re.finditer — bug in Python 2.7 re module? 除了在 python 中的 re.findall() 和 re.finditer() 中返回字符串和迭代器之外,它们的工作方式是否也不同? - Apart from returning string and iterator in re.findall() and re.finditer() in python do their working also differ? 是否有Perl相当于Python的re.findall / re.finditer(迭代正则表达式结果)? - Is there a Perl equivalent of Python's re.findall/re.finditer (iterative regex results)? 如何在正则表达式上为 re.findall 和 re.finditer 得出相同的结果? - How can I make same result for re.findall and re.finditer on regular expressions? 使用re.finditer和re.match时的不同行为 - different behavior when using re.finditer and re.match Python正则表达式re.finditer与match.end()的奇怪行为 - Python regex re.finditer weird behavior with match.end() Python RE-finditer和findall的不同匹配 - Python RE - different matching for finditer and findall 难以理解re.findall()行为 - Trouble understanding re.findall() behavior
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM