简体   繁体   中英

Different behavior between re.finditer and re.findall

I am using the following code:

CARRIS_REGEX=r'<th>(\d+)</th><th>([\s\w\.\-]+)</th><th>(\d+:\d+)</th><th>(\d+m)</th>'
pattern = re.compile(CARRIS_REGEX, re.UNICODE)
matches = pattern.finditer(mailbody)
findall = pattern.findall(mailbody)

But finditer and findall are finding different things. Findall indeed finds all the matches in the given string. But finditer only finds the first one, returning an iterator with only one element.

How can I make finditer and findall behave the same way?

Thanks

I can't reproduce this here. Have tried it with both Python 2.7 and 3.1.

One difference between finditer and findall is that the former returns regex match objects whereas the other returns a tuple of the matched capturing groups (or the entire match if there are no capturing groups).

So

import re
CARRIS_REGEX=r'<th>(\d+)</th><th>([\s\w\.\-]+)</th><th>(\d+:\d+)</th><th>(\d+m)</th>'
pattern = re.compile(CARRIS_REGEX, re.UNICODE)
mailbody = open("test.txt").read()
for match in pattern.finditer(mailbody):
    print(match)
print()
for match in pattern.findall(mailbody):
    print(match)

prints

<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>
<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>
<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>
<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>

('790', 'PR. REAL', '21:06', '04m')
('758', 'PORTAS BENFICA', '21:10', '09m')
('790', 'PR. REAL', '21:14', '13m')
('758', 'PORTAS BENFICA', '21:21', '19m')
('790', 'PR. REAL', '21:29', '28m')
('758', 'PORTAS BENFICA', '21:38', '36m')
('758', 'SETE RIOS', '21:49', '47m')
('758', 'SETE RIOS', '22:09', '68m')

If you want the same output from finditer as you're getting from findall , you need

for match in pattern.finditer(mailbody):
    print(tuple(match.groups()))

re.findall(pattern.string)

findall() returns all non-overlapping matches of pattern in string as a list of strings.

re.finditer()

finditer() returns callable object .

In both functions, the string is scanned from left to right and matches are returned in order found.

You can't make them behave the same way, because they're different. If you really want to create a list of results from finditer , then you could use a list comprehension:

>>> [match for match in pattern.finditer(mailbody)]
[...]

In general, use a for loop to access the matches returned by re.finditer :

>>> for match in pattern.finditer(mailbody):
...     ...

I get this example from Regular expression operations in Python 2.* Documentation and that example well described here in details with some modification. To explain whole example, let's get string type variable call,

text = "He was carefully disguised but captured quickly by police."

and the compile type regular expression pattern as,

regEX = r"\w+ly"
pattern = re.compile(regEX)

\\w mean matches any word character (alphanumeric & underscore) , + mean matches 1 or more of the preceding token and the whole meaning is select any word which is end-up with ly . There are only two 2 words('carefully' and 'quickly') which is satisfied the above regular expression.

Before move into re.findall() or re.finditer() , let's see what does re.search() mean in Python 2.* Documentation .

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

Following code lines gives you the basic understand of re.search() .

search = pattern.search(text)
print(search)
print(type(search))

#output
<re.Match object; span=(7, 16), match='carefully'>
<class 're.Match'>

It will generate re.MatchObject of class type object which have 13 of supported methods and attributes according to Python 2.* Documentation . This span() method consist with the start and end points(7 and 16 present in the above example) of the matched word in text variable. re.search() method only consider about the very first match, otherwise return None .

Let's move into the question, before that see what does re.finditer() mean in Python 2.* Documentation .

Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

Coming next code lines gives you the basic understand of re.finditer() .

finditer = pattern.finditer(text)
print(finditer)
print(type(finditer))

#output
<callable_iterator object at 0x040BB690>
<class 'callable_iterator'>

The above example gives us the Iterator Objects which need to be loop. This is obviously not the result we want. Let's loop finditer and see what's inside this Iterator Objects .

for anObject in finditer:
    print(anObject)
    print(type(anObject))
    print()

#output
<re.Match object; span=(7, 16), match='carefully'>
<class 're.Match'>

<re.Match object; span=(40, 47), match='quickly'>
<class 're.Match'>

This results are much similar to the re.search() result which we get earlier. But we can see the new result in above output, <re.Match object; span=(40, 47), match='quickly'> <re.Match object; span=(40, 47), match='quickly'> . As I mention earlier in Python 2.* Documentation , re.search() will scan through string looking for the first location where the regular expression pattern produces a match and re.finditer() will scan through string looking for all the locations where the regular expression pattern produces matches and return more details than re.findall() method.

Here what does re.findall() mean in Python 2.* Documentation .

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

Let's understand what happen in re.findall() .

findall = pattern.findall(text)
print(findall)
print(type(findall))

#output
['carefully', 'quickly']
<class 'list'>

This output only gives us the matched words in text variable, otherwise return an empty list . That list in the output which is similar to the match attribute in re.MatchObject .

Here is the full code and I tried in Python 3.7 .

import re

text = "He was carefully disguised but captured quickly by police."

regEX = r"\w+ly"
pattern = re.compile(regEX)

search = pattern.search(text)
print(search)
print(type(search))
print()

findall = pattern.findall(text)
print(findall)
print(type(findall))
print()

finditer = pattern.finditer(text)
print(finditer)
print(type(finditer))
print()
for anObject in finditer:
    print(anObject)
    print(type(anObject))
    print()

I came here trying to get a string from my .finditer() 's regex results

The solution was practically that I needed to create at least one group , which enabled fetching it from the group dict

-     yield from zip(re.finditer(r"\w+", line) ...
+     yield from zip(re.finditer(r"(\w+)", line) ...
...
-     block.(miscellaneous attempts)
+     block.group(1)

Make use of finditer() when you are extracting from a large file since it will return an iterator object which helps in saving memory on the other hand findall() returns a list. And finditer() will extract differently than findall().

For example:


    text_to_search = '''
    abcdefghijklmnopqurtuvwxyz
    ABCDEFGHIJKLMNOPQRSTUVWXYZ\s
    321-555-4321
    1234567890
    Ha HaHa
    MetaCharacters (Need to be escaped):
    . ^ $ * + ? { } [ ] \ | ( )
    khanafsaan11.com
    321-555-4321
    123.555.1234
    123*555*-1234
    123.555.1234
    800-555-1234
    900-555-1234
    Mr. Schafer
    Mr Smith
    Ms Davis
    Mrs. Robinson
    Mr. T
    Mr_hello
    '''
    pattern=re.compile(r'M(r|rs|s)\.? [A-Z][a-z]*')
    print(list(pattern.finditer(text_to_search))) #converted to list
    print(pattern.findall(text_to_search))

Output:


    ['r', 'r', 's', 'rs', 'r'] 
    [, , , , ]

And you can get the output like findall() from finditer() output as following


    for obj in pattern.finditer(text_to_search):
        print(obj.group()) #group() is an attribute of re.Match object
    #ouput
    Mr. Schafer
    Mr Smith
    Ms Davis
    Mrs. Robinson
    Mr. T

finditer() returns iterator object, finditer() helps with memory efficency its based on the generators.

def my_ranger(max_num):

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM