简体   繁体   中英

Regex match in Python

I have a regex like this

r"^(.*?),(.*?)(,.*?=.*)"

And a string like this

name1,value1,tag11=value11,tag12=value12,tag13=value13

I am trying to check, using a regex, whether the string follows the following format: name,value , name and value pairs separated by commas.

I need then to extract the comma-separated data using a regex.

I am getting the data extracted as a first group as name1 and a second group as value2 and a third group matches completely from tag11 to value13 (due to greedy match).

But I want to match each name and value pairs. I am new to Python and not sure how can I achieve this.

Why not just split by the commas:

s = 'name1,value1,tag11=value11,tag12=value12,tag13=value13'
print(s.split(','))

If you want to use regex it's just as simple using the pattern:

[^,]+

Example:

https://regex101.com/r/jS6fgW/1

Turns out Python doesn't support repeated named capture groups unlike .NET, which is a bit of a shame (means my solution is a little longer than I thought it'd need to be). Does this meet your requirements?

import re

def is_valid(s):
    pattern = '^name\d+,value\d+(,tag\d+=value\d+)*$'
    return re.match(pattern, s)

def get_name_value_pairs(s):
    if not is_valid(s):
        raise ValueError('Invalid input: {}'.format(s))

    pattern = '((?P<name1>\w+),(?P<value1>\w+))|(?P<name2>\w+)=(?P<value2>\w+)'
    for match in re.finditer(pattern, s):
        name1 = match.group('name1')
        name2 = match.group('name2')
        value1 = match.group('value1')
        value2 = match.group('value2')

        if name1 and value1:
            yield name1, value1
        elif name2 and value2:
            yield name2, value2

if __name__ == '__main__':
    testString = 'name1,value1,tag11=value11,tag12=value12,tag13=value13'
    assert not is_valid('')
    assert not is_valid('foo')
    assert is_valid(testString)

    print(list(get_name_value_pairs(testString)))

Output

[('name1', 'value1'), ('tag11', 'value11'), ('tag12', 'value12'), ('tag13', 'value13')]

Edit 1

Added input validation logic. Assumptions made:

  • Must have initial name/value pair in form name<x>,value<x>
  • All following pairs must be in form tag<x>=value<x>
  • Names and values consist only of alphanumeric characters
  • Whitespace is not allowed

Note that I'm not currently validating that x is the same value within a name/value pair, which I assume is a requirement. I'm not sure how to do this leaving this as an exercise for the reader.

First, validate the format acc. to your pattern, and then split with [,=] regex (that matches , and = ) and convert to a dictionary like this:

import itertools, re
s = 'name1,value1,tag11=value11,tag12=value12,tag13=value13'
if re.match(r'[^,=]+,[^,=]+(?:,[^,=]+=[^,=]+)+$', s):
    l = re.split("[=,]", s)
    d = dict(itertools.izip_longest(*[iter(l)] * 2, fillvalue=""))
    print(d)
else:
    print("Not valid!")

See the Python demo

The pattern is

^[^,=]+,[^,=]+(?:,[^,=]+=[^,=]+)+$

Details :

  • ^ - start of string (in the re.match , this can be omitted since the pattern is already anchored)
  • [^,=]+ - 1+ chars other than = and ,
  • , - a comma
  • [^,=]+ - 1+ chars other than = and ,
  • (?:,[^,=]+=[^,=]+)+ - 1 or more sequences of:
    • , - comma
    • [^,=]+ - 1+ chars other than = and ,
    • = - an equal sign
    • [^,=]+ - 1+ chars other than = and ,
  • $ - end of string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM