简体   繁体   中英

Python RE, problem with optional match groups

My apologies if this has been asked before. I am parsing some law numbers from the California penal code so they can be run through an existing database to return a plain-language title of the law. For example:

PC 182(A)(1); PC 25400(A)(1); PC 25850(C)(6); PC 32310; VC 12500(A); VC 22517; VC 23103(A)

Each would be split at ';'and parsed into:

{'lawType': 'PC', 'lawNumber': '182', 'subsection': 'A', 'subsubsection': '1'} Returns: Conspiracy to commit a crime

Here's my RE search:

(?P<lawType>[A-Z]{2})[ ](?P<lawNumber>[0-9.]*[A-Z]?)\((?P<subsection>[A-Z])\)?\((?P<subsubsection>[0-9])\)?

Every law should have at least the type and number (ie PC 182), but sometimes they also have the subsection and subsubsection (ie (A)(1)). Those two subgroups need to be optional, but the search above isn't picking them up using the '?'. This code works, but I'd like to make it more compact with just one search:

lineValue = 'PC 182(A)(1); PC 25400(A)(1); PC 25850(C)(6); PC 32310; VC 12500(A); VC 22517; VC 23103(A)'
#lineValue = 'PC 148(A)(1); PC 369I; PC 587C; MC 8.80.060(F)'
chargeList = map(lambda x: x.strip(), lineValue.split(';'))
for thisCharge in chargeList:
    m = re.match(r'(?P<lawType>[A-Z]{2})[ ](?P<lawNumber>[0-9.]*[A-Z]?)\((?P<subsection>[A-Z])\)\((?P<subsubsection>[0-9])\)', thisCharge)
    if m:
        detail = m.groupdict()
        print(detail)

    else:
        m = re.match(r'(?P<lawType>[A-Z]{2})[ ](?P<lawNumber>[0-9.]*[A-Z]?)\((?P<subsection>[A-Z])\)', thisCharge)
        if m:
            detail = m.groupdict()
            print(detail)

        else:
            m = re.match(r'(?P<lawType>[A-Z]{2})[ ](?P<lawNumber>[0-9.]*[A-Z]?)', thisCharge)
            if m:
                detail = m.groupdict()
                print(detail)

            else:
                print('NO MATCH: ' + str(thisCharge))

I have three different searches, which shouldn't be necessary if the '?' optional group marker were working as expected. Can anyone offer a thought?

The problem is in how you are applying the ? to make each of the subsections optional. A ? applies to just the term immediately preceding it. In your case, this is just the closing parentheses for each subsection Because of this, you are requiring the opening parentheses and the number or letter unconditionally for each subsection term. To fix this, just wrap the complete subsection terms in an extra set of parentheses, and apply the ? to those groups. This code:

import re

data = "PC 182(A)(1); PC 25400(A)(1); PC 25850(C)(6); PC 32310; VC 12500(A); VC 22517; VC 23103(A)"

exp = re.compile(r"(?P<lawType>[A-Z]{2})[ ](?P<lawNumber>[0-9.]*[A-Z]?)(?:\((?P<subsection>[A-Z])\))?(?:\((?P<subsubsection>[0-9])\))?")

def main():
    r = exp.findall(data)
    print(r)

main()

produces:

[('PC', '182', 'A', '1'), ('PC', '25400', 'A', '1'), ('PC', '25850', 'C', '6'), ('PC', '32310', '', ''), ('VC', '12500', 'A', ''), ('VC', '22517', '', ''), ('VC', '23103', 'A', '')]

Here's an example of how to use your expression to pick out the information for each law individually, making use of your group labels:

def main():
    p = 0
    while True:
        m = exp.search(data[p:])
        if not m:
            break
        print('Type:', m.group('lawType'))
        print('Number:', m.group('lawNumber'))
        if m.group('subsection'):
            print('Subsection:', m.group('subsection'))
        if m.group('subsubsection'):
            print('Subsubsection:', m.group('subsubsection'))
        print()
        p += m.end()

Result

Type: PC
Number: 182
Subsection: A
Subsubsection: 1

Type: PC
Number: 25400
Subsection: A
Subsubsection: 1

Type: PC
Number: 25850
Subsection: C
Subsubsection: 6

Type: PC
Number: 32310

Type: VC
Number: 12500
Subsection: A

Type: VC
Number: 22517

Type: VC
Number: 23103
Subsection: A

Noticing how you were pre-splitting your data before applying your regex, I thought you might want to see how I would process each term using only regex matching.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM