简体   繁体   中英

Get all unnamed groups in a Python match object

I have a number of related regular expressions that use both named and unnamed groups. I want to plug the unnamed groups as positional arguments to a function chosen using the named group.

For an example, with the pattern ([abc]+)([123]+)(?P<end>[%#]) matching the string "aaba2321%" , I want to get a list containing ["aaba", "2321"] , but not "%"

I tried the following:

match_obj.groups()

under the assumption that it wouldn't capture the named groups as there is a separate method, groupdict , for getting only the named groups. Unfortunately, groups included named groups.

Then, I decided to write my own generator for it:

def get_unnamed_groups(match_obj):
    index = 1
    while True:
        try: yield match_obj.group(index)
        except IndexError: break
        index += 1

Unfortunately the named group can also be accessed as a numbered group. How do I get the numbered groups alone?

There is a somewhat horrible way to do what you're asking for. It involves indexing all matches by their span (start and end indices) and removing the ones that occur in both groupdict and groups :

named = dict()
unnamed = dict()
all = mo.groups()

# Index every named group by its span
for k,v in mo.groupdict().items():
    named[mo.span(k)] = v

# Index every other group by its span, skipping groups with same 
# span as a named group
for i,v in enumerate(all):
    sp = mo.span(i + 1)
    if sp not in named:
        unnamed[sp] = v

print(named)   # {(8, 9): '%'}
print(unnamed) # {(4, 8): '2321', (0, 4): 'aaba'}

The reason indexing by span is necessary is because unnamed and named groups can have the same value. The only unique identifier of a group is where it starts and ends, so this code works fine even when you have groups with the same value. Here is a demo: http://ideone.com/9O7Hpb

Another way to do it would be to write a function that transforms a regex following the form shown in your question to one where all formerly unnamed regexes are named with some prefix and a number. You could match against this regex and pick out the groups that have a name starting with the prefix from groupdict

Here is a clean version, using re. regex .groupindex re. regex .groupindex :

A dictionary mapping any symbolic group names defined by (?P<id>) to group numbers.


TL;DR: Short copy & paste function:

def grouplist(match):
    named = match.groupdict()
    ignored_groups = set()
    for name, index in match.re.groupindex.items():
        if name in named:  # check twice, if it is really the named attribute.
            ignored_groups.add(index)
    return [group for i, group in enumerate(match.groups()) if i+1 not in ignored_groups]


m = re.match('([abc]+)([123]+)(?P<end>[%#])', "aaba2321%")

unnamed = grouplist(m)
print(unnamed)

Full example

With groupindex we get the indexes of the named matches, and can exclude them when building our final list of groups, called unnamed in the code below:

import re

# ===================================================================================
# This are the current matching groups:
# ===================================================================================
regex = re.compile("(((?P<first_name>\w+)) (?P<middle_name>\w+)) (?P<last_name>\w+)") 
#                   |-------------------- #1 ------------------|
#                    |------- #2 -------|
#                     |------ #3 ------|
#                                          |------- #4 -------|
#                                                                |------ #5 ------|
# ===================================================================================
# But we want to have the following groups instead (regex line is identical):
# ===================================================================================
regex = re.compile("(((?P<first_name>\w+)) (?P<middle_name>\w+)) (?P<last_name>\w+)")
#                   |---------------- #1 (#1) -----------------|
#                    |- first_name (#2) -|
#                     |---- #2 (#3) ----|
#                                          |- middle_name (#4)-|
#                                                                | last_name (#5) |

m = regex.match("Pinkamena Diane Pie")

This are the values we want to use, for your convenience:

assert list(m.groups()) == [
    'Pinkamena Diane',  # group #1
    'Pinkamena',        # group #2 (first_name)
    'Pinkamena',        # group #3
    'Diane',            # group #4 (middle_name)
    'Pie',              # group #5 (last_name)
]

assert dict(m.groupdict()) == {
    'first_name':  'Pinkamena',  # group #2
    'middle_name': 'Diane',      # group #4
    'last_name':   'Pie',        # group #5
}

assert dict(m.re.groupindex) == {
    'first_name':  2,  # Pinkamena
    'middle_name': 4,  # Diane
    'last_name':   5,  # Pie
}

Therefore we can now store the indices of those named groups in a ignored_groups set, to omit those groups when filling unnamed with m.groups() :

named = m.groupdict()
ignored_groups = set()
for name, index in m.re.groupindex.items():
    if name in named:  # check twice, if it is really the named attribute.
        ignored_groups.add(index)
    # end if
unnamed = [group for i, group in enumerate(m.groups()) if i+1 not in ignored_groups]
# end for

print(unnamed)
print(named)

So in the end we get:

# unnamed = grouplist(m)
assert unnamed == [
    'Pinkamena Diane',  # group #1 (#1)
    'Pinkamena',        # group #2 (#3)
]

# named = m.groupdict()
assert named == {
    'first_name':  'Pinkamena',  # group #2
    'middle_name': 'Diane',      # group #4
    'last_name':   'Pie',        # group #5
}

Try the example yourself: https://ideone.com/pDMjpP

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM