简体   繁体   中英

Why does re.findall() give me different results than re.finditer() in Python?

I wrote up this regular expression:

p = re.compile(r'''
\[\[            #the first [[
[^:]*?          #no :s are allowed
.*?             #a bunch of chars
(
\|              #either go until a |
|\]\]           #or the last ]]
)
                ''', re.VERBOSE)

I want to use re.findall to get all the matching sections of some string. I wrote some test code, but it gives me bizarre results.

This code

g = p.finditer('   [[Imae|Lol]]     [[sdfef]]')
print g
for elem in g:
    print elem.span()
    print elem.group()

gives me this output:

(3, 10)
[[Imae|
(20, 29)
[[sdfef]] 

Makes perfect sense right? But when I do this:

h = p.findall('   [[Imae|Lol]]     [[sdfef]]')
for elem in h:
    print elem

the output is this:

|
]]  

Why isn't findall() printing out the same results as finditer??

Findall returns a list of matching groups. The parantheses in your regex defines a group that findall thinks you want, but you don't want groups. (?:...) is a non-capturing paranthesis. Change your regex to:

'''
\[\[            #the first [[
[^:]*?          #no :s are allowed
.*?             #a bunch of chars
(?:             #non-capturing group
\|              #either go until a |
|\]\]           #or the last ]]
)
                '''

When you give re.findall() a regex with groups (parenthesized expressions) in it, it returns the groups that match. Here, you've only got one group, and it's the | or ]] at the end. On the other hand, in the code where you use re.finditer(), you're asking for no group in particular, so it gives you the entire string.

You can get re.findall() to do what you want by putting parentheses around the whole regex -- or just around the part you're actually trying to extract. Assuming you're trying to parse wiki links, that would be the "bunch of chars" in line 4. For example,

p = re.compile(r'''
\[\[            #the first [[
[^:]*?          #no :s are allowed
(.*?)           #a bunch of chars
(
\|              #either go until a |
|\]\]           #or the last ]]
)
                ''', re.VERBOSE)

p.findall('   [[Imae|Lol]]     [[sdfef]]')

returns:

[('Imae', '|'), ('sdfef', ']]')]

I think the key bit from the findall() documentation is this:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Your regex has a group around the pipe or closing ]] here:

(
\|              #either go until a |
|\]\]           #or the last ]]
)

finditer() doesn't appear to have any such clause.

They don't return the same thing. Some snippets from the docs :

findall returns a list of strings. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

finditer returns an iterator yielding MatchObject instances.

From the python documentation:

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

Note that it says if groups are present then a list of the group matches will be returned. The capturing group you have at the end of your regex is matching and so only the captured part of the groups in each match is returned. This information is simply another field in the MatchObject object when you use finditer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM