简体   繁体   English

Python中的正则表达式匹配

[英]Regex match in Python

I have a regex like this 我有这样的正则表达式

r"^(.*?),(.*?)(,.*?=.*)"

And a string like this 和这样的字符串

name1,value1,tag11=value11,tag12=value12,tag13=value13 NAME1,值1,TAG11 = value11,tag12 = value12,tag13 = value13

I am trying to check, using a regex, whether the string follows the following format: name,value , name and value pairs separated by commas. 我正在尝试使用正则表达式检查字符串是否遵循以下格式:以逗号分隔的name,value ,名称和值对。

I need then to extract the comma-separated data using a regex. 然后,我需要使用正则表达式提取逗号分隔的数据。

I am getting the data extracted as a first group as name1 and a second group as value2 and a third group matches completely from tag11 to value13 (due to greedy match). 我将提取的数据作为第一组作为name1,将第二组作为value2提取,而第三组则完全从tag11匹配到value13(由于贪婪匹配)。

But I want to match each name and value pairs. 但是我想匹配每个名称和值对。 I am new to Python and not sure how can I achieve this. 我是Python新手,不确定如何实现此目标。

Why not just split by the commas: 为什么不只用逗号分开:

s = 'name1,value1,tag11=value11,tag12=value12,tag13=value13'
print(s.split(','))

If you want to use regex it's just as simple using the pattern: 如果要使用正则表达式,使用模式就一样简单:

[^,]+

Example: 例:

https://regex101.com/r/jS6fgW/1 https://regex101.com/r/jS6fgW/1

Turns out Python doesn't support repeated named capture groups unlike .NET, which is a bit of a shame (means my solution is a little longer than I thought it'd need to be). 事实证明,Python与.NET不同,它不支持重复的命名捕获组,这有点可惜(这意味着我的解决方案比我想象的要长一点)。 Does this meet your requirements? 这符合您的要求吗?

import re

def is_valid(s):
    pattern = '^name\d+,value\d+(,tag\d+=value\d+)*$'
    return re.match(pattern, s)

def get_name_value_pairs(s):
    if not is_valid(s):
        raise ValueError('Invalid input: {}'.format(s))

    pattern = '((?P<name1>\w+),(?P<value1>\w+))|(?P<name2>\w+)=(?P<value2>\w+)'
    for match in re.finditer(pattern, s):
        name1 = match.group('name1')
        name2 = match.group('name2')
        value1 = match.group('value1')
        value2 = match.group('value2')

        if name1 and value1:
            yield name1, value1
        elif name2 and value2:
            yield name2, value2

if __name__ == '__main__':
    testString = 'name1,value1,tag11=value11,tag12=value12,tag13=value13'
    assert not is_valid('')
    assert not is_valid('foo')
    assert is_valid(testString)

    print(list(get_name_value_pairs(testString)))

Output 产量

[('name1', 'value1'), ('tag11', 'value11'), ('tag12', 'value12'), ('tag13', 'value13')]

Edit 1 编辑1

Added input validation logic. 添加了输入验证逻辑。 Assumptions made: 做出的假设:

  • Must have initial name/value pair in form name<x>,value<x> 必须具有格式name<x>,value<x>初始名称/值对
  • All following pairs must be in form tag<x>=value<x> 以下所有对必须采用tag<x>=value<x>
  • Names and values consist only of alphanumeric characters 名称和值仅包含字母数字字符
  • Whitespace is not allowed 不允许空格

Note that I'm not currently validating that x is the same value within a name/value pair, which I assume is a requirement. 请注意,我目前不验证x是名称/值对中的相同值,我认为这是必要条件。 I'm not sure how to do this leaving this as an exercise for the reader. 不确定如何执行此操作 ,这只是读者的练习。

First, validate the format acc. 首先,验证格式acc。 to your pattern, and then split with [,=] regex (that matches , and = ) and convert to a dictionary like this: 到您的模式,然后使用[,=]正则表达式(与,=匹配)进行拆分,并转换为这样的字典:

import itertools, re
s = 'name1,value1,tag11=value11,tag12=value12,tag13=value13'
if re.match(r'[^,=]+,[^,=]+(?:,[^,=]+=[^,=]+)+$', s):
    l = re.split("[=,]", s)
    d = dict(itertools.izip_longest(*[iter(l)] * 2, fillvalue=""))
    print(d)
else:
    print("Not valid!")

See the Python demo 参见Python演示

The pattern is 模式

^[^,=]+,[^,=]+(?:,[^,=]+=[^,=]+)+$

Details : 详细资料

  • ^ - start of string (in the re.match , this can be omitted since the pattern is already anchored) ^ -字符串的开头(在re.match ,由于模式已经锚定,因此可以省略)
  • [^,=]+ - 1+ chars other than = and , [^,=]+ - 1+字符以外=,
  • , - a comma , -逗号
  • [^,=]+ - 1+ chars other than = and , [^,=]+ - 1+字符以外=,
  • (?:,[^,=]+=[^,=]+)+ - 1 or more sequences of: (?:,[^,=]+=[^,=]+)+ -1个或多个序列:
    • , - comma , -逗号
    • [^,=]+ - 1+ chars other than = and , [^,=]+ - 1+字符以外=,
    • = - an equal sign = -等号
    • [^,=]+ - 1+ chars other than = and , [^,=]+ - 1+字符以外=,
  • $ - end of string. $ -字符串结尾。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM