简体   繁体   English

Python Regex - 将多个表达式与组匹配

[英]Python Regex - Match multiple expression with groups

I have a string: 我有一个字符串:

property1=1234, property2=102.201.333, property3=abc

I want to capture 1234 and 102.201.333. 我想捕获1234和102.201.333。 I am trying to use the regex: 我正在尝试使用正则表达式:

property1=([^,]*)|property2=([^,]*)

But it only manages to capture one of the values. 但它只能设法捕获其中一个值。 Based on this link I also tried: 根据这个链接,我也尝试过:

((?:property1=([^,]*)|property2=([^,])+)
(?:(property1=([^,]*)|property2=([^,])+)

They capture an extra group from somewhere I can't figure. 他们从我无法想象的地方捕获了一个额外的组。

What am I missing? 我错过了什么?

PS I am using re.search(). PS我正在使用re.search()。

Edit: There may be something wrong in my calling code: 编辑:我的调用代码可能有问题:

m = re.search('property1=([^,]*)|property2=([^,]*)', text);
print m.groups()

Edit2: It doesn't have to be propertyX. Edit2:它不一定是propertyX。 It can be anything: 它可以是任何东西:

foo1=123, bar=101.2.3, foobar=abc

even 甚至

foo1=123, bar=weirdbar[345], foobar=abc

As an alternative, we could use some string splitting to create a dictionary. 作为替代方案,我们可以使用一些字符串拆分来创建字典。

text = "property1=1234, property2=102.201.333, property3=abc"
data = dict(p.split('=') for p in text.split(', '))
print data["property2"] # '102.201.333'

Regular expressions are great for things that act like lexemes , not so good for general purpose parsing. 正则表达式对于像lexemes这样的东西很有用,对于通用解析不太好。

In this case, though, it looks like your "configuration-y string" may consist solely of a sequence of lexemes of the form: word = value [ , word = value ... ]. 但是,在这种情况下,看起来你的“configuration-y string” 可能只包含一系列形式的词汇: word = value [ , word = value ...]。 If so, you can use a regexp and repetition. 如果是这样,您可以使用正则表达式和重复。 The right regexp depends on the exact form of word and value , though (and to a lesser extent, whether you want to check for errors). 正确的正则表达式取决于单词的确切形式(但在较小程度上,取决于您是否要检查错误)。 For instance, is: 例如,是:

this="a string with spaces", that = 42, quote mark = "

allowed, or not? 允许还是不允许? If so, is this set to a string with spaces (no quotes) or "a string with spaces" (includes quotes)? 如果是这样, this是设置为a string with spaces (没有引号) "a string with spaces"还是"a string with spaces" (包括引号)? Is that set to 42 (which has a leading blank) or just 42 (which does not)? 是否that设置为42 (其中有一个领先的空白)或仅42 (不)? Is quote mark (which has embedded spaces) allowed, and is it set to one double-quote mark? 是否允许使用quote mark (包含嵌入空格),是否设置为双引号? Do double quotes, if present, "escape" commas, so that you can write: 双引号,如果存在,“转义”逗号,这样你就可以写:

greeting="Hello, world."

Assuming spaces are forbidden, and the word and value parts are simply "alphanumerics as matched by \\w ": 假设禁止使用空格,而单词部分只是“与\\w匹配的字母数字”:

for word, value in re.findall(r'([\w]+)=([\w]+)', string):
    print word, value

It's clear from the 102.201.333 value that \\w is not sufficient for the value match, though. 102.201.333值可以清楚地看出\\w不足以进行value匹配。 If value is "everything not a comma" (which includes whitespace), then: 如果是“一切都不是逗号”(包括空格),那么:

for word, value in re.findall(r'([\w]+)=([^,]+)', string):
    print word, value

gets closer. 越来越近了。 These all ignore "junk" and disallow spaces around the = sign. 这些都忽略了“垃圾”并且不允许=符号周围的空格。 If string is "$a=this, b = that, c=102.201.333,," , the second for loop prints: 如果string"$a=this, b = that, c=102.201.333,," ,则第二个for循环打印:

a this
c 102.201.333

The dollar-sign (not an alphanumeric character) is ignored, the value for b is ignored due to white-space, and the two commas after the value for c are also ignored. 忽略美元符号(不是字母数字字符),由于空格而忽略b的值,并且忽略c值后面的两个逗号。

You're using a | 你正在使用| . That means your regex will match either the thing on the left of the bar, or the thing on the right. 这意味着你的正则表达式将匹配条形图左侧的东西或右侧的东西。

you could try: 你可以尝试:

property_regex = re.compile('property[0-9]+=(?P<property_value>[^\s]+)')

that would match any property after the equals sign and before a space. 这将匹配等号后面和空格前的任何属性。 It would be accessible from the name property_value just like the documentation says: 它可以从名称property_value访问,就像文档说的那样:

copied from python re documentation python re文档复制

For example, if the pattern is (?P[a-zA-Z_]\\w*), the group can be referenced by its name in arguments to methods of match objects, such as m.group('id') or m.end('id'), and also by name in the regular expression itself (using (?P=id)) and replacement text given to .sub() (using \\g). 例如,如果模式是(?P [a-zA-Z _] \\ w *),则可以通过匹配对象方法的参数在名称中引用该组,例如m.group('id')或m .end('id'),以及正则表达式本身的名称(使用(?P = id))和赋予.sub()的替换文本(使用\\ g)。

尝试这个:

property_regex = re.compile('property[0-9]+=([^\s]+)')

I have tried building a regular expression for you which will give you the values after property1= and property2 but I am not sure how you use them in Python. 我已经尝试为你构建一个正则表达式,它将在property1 =和property2之后为你提供值,但我不确定你是如何在Python中使用它们的。

Edit 编辑

now captures other stuff apart from property before the '=' sign. 现在在'='符号之前捕获除属性之外的其他东西。

This is my original regular expression which does capture the value. 这是我原来的正则表达式,它确实捕获了值。

(?<=[\\w]=).*?[^,]+ (?<= [\\ W] =)。*?[^,] +

and this is a variation of the above, IMO what I believe you would need to use in Python 这是以上的变种,IMO我认为你需要在Python中使用

/(?<=[\w]=).*?[^,]+/g

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM