简体   繁体   中英

Python 3.7.1 findall() not behaving as expected

First of all, I know that this is not the current version of Python and that the behavior of findall() was changed from 3.6. I don't believe either of those are issue I'm experiencing. And I haven't been able to find anything about findall() that has changed since 3.7.

I have already devised a fix using sub() instead of findall(), but I'm curious why I had to in the first place.

I have a function that is supposed to check for the presence of a pattern. If found, it's supposed to verify that the pattern has been previously defined. It looks like this at present (with the fix and some debug code):

    def _verifyargs(i, end, args):
        '''verify text replacement args'''

        def _findallfix(m):
            formals.append( m.group().upper() )
            return '-xxx- '
            
        # put any formal arguments into a more convenient form for checking

        checkargs = args.keys()
        print( f'checkargs: start={i}, end= {end}, args= {checkargs}' )

        # if there aren't any formal arguments we're still checking for
        # their improper use within the definition body

        while i < end:
            i, text = SRC.fetch( i+1 )
            SRC.setmaster( i )
            formals = []
            text = re.sub( SYM.macLabel, _findallfix, text, flags=re.IGNORECASE )
#           formals = re.findall( SYM.macLabel, text, flags=re.IGNORECASE )
            print( f'line= {i}, formals= {formals}' )
            for formal in formals:
#               formal = formal.upper()
                if not formal in checkargs:
                    UM.undefined( formal )

        SRC.setmaster(end)

The pattern looks like this:

SYM.macLabel = '[?][_A-Z]([.]?[_A-Z0-9])*'              # straight text replacement

When run against this piece of test code:

测试 100 的一部分

It produces this output:

期望的输出(工作)

Which is fine. It's what I want. But if I comment out the fix:

    def _verifyargs(i, end, args):
        '''verify text replacement args'''

        def _findallfix(m):
            formals.append( m.group().upper() )
            return '-xxx- '
            
        # put any formal arguments into a more convenient form for checking

        checkargs = args.keys()
        print( f'checkargs: start={i}, end= {end}, args= {checkargs}' )

        # if there aren't any formal arguments we're still checking for
        # their improper use within the definition body

        while i < end:
            i, text = SRC.fetch( i+1 )
            SRC.setmaster( i )
#           formals = []
#           text = re.sub( SYM.macLabel, _findallfix, text, flags=re.IGNORECASE )
            formals = re.findall( SYM.macLabel, text, flags=re.IGNORECASE )
            print( f'line= {i}, formals= {formals}' )
            for formal in formals:
                formal = formal.upper()
                if not formal in checkargs:
                    UM.undefined( formal )

        SRC.setmaster(end)

...then the test produces this:

在此处输入图像描述

So findall() seems to be making an unexpected match, even though my understanding is that sub() and findall() should have exactly the same matching behavior.

Perhaps I'm abusing sub(). In this instance I don't care at all about the result of the substitution (I save it here only because I might want to look at it), but only that it finds the patterns I expect. Is there something I'm overlooking about the way findall() works?

TL;DR

Use (?: ... ) instead of (... ) because re.findall is giving you the capturing group instead of the whole matches.

Details

This question puzzled me for a bit, but I found the problem.

The documentation for re.findall says:

The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.

Since you have one set of parentheses, you have one captugin group, and that's what re.findall is returning. It matches what you expect, it just doesn't return what you thought it would.

By using non-capturing parentheses, (?: ... ) you will get the results you want: the whole matches.

Ie:

SYM.macLabel = '[?][_A-Z](?:[.]?[_A-Z0-9])*'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM