简体   繁体   English

正则表达式以键/值对解析定界字符串(Python)

[英]Regex to parse delimited string with key/value pairs (python)

I have data in text format, where key/value pairs are separated by semi-colon, may be followed by whitespace, maybe not, eg, ";" 我有文本格式的数据,其中键/值对之间用分号分隔,后面可能跟空格,例如,可能不是“;” or "; ", or even " ; ". 或“;”,甚至“;”。 There will always be a semi-colon between pairs, and the string is terminated with a semi-colon. 线对之间始终会有分号,并且字符串以分号结尾。

Keys and values are separated by whitespace. 键和值由空格分隔。

This string is flat. 这根弦是扁平的。 There's never anything nested. 没有任何嵌套。 Strings are always quoted and numerical values are never quoted. 字符串总是用引号引起来,而数值则不用引号引起来。 I can count on this being consistent in the input. 我可以指望这在输入中保持一致。 So for example, 例如

'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'

Ultimately this winds up as 最终,随着

{'cheese': "stilton", 'pigeons': 17, 'color': "blue"; 'why': "because I said so"}

Different strings may include different key/value pairs, and I can't know in advance which keys will be present. 不同的字符串可能包含不同的键/值对,因此我无法事先知道将出现哪些键。 So this is equally valid input string: 因此,这是同样有效的输入字符串:

mass 6.02 ; mammal "gerbil";telephone "+1 903 555-1212"; size "A1";

I'm thinking that a regex to split the string into a list would be a good start, then just iterate through the list by twos to build the dictionary. 我认为正则表达式可以将字符串拆分为一个列表,这是一个好的开始,然后只需将列表遍历两次就可以构建字典。 Something like 就像是

x = PATTERN.split(s)
d = {}
for i in range(0, len(x), 2):
    d[x[i]] = d[x[i+1]]

Which requires a list like ['cheese', 'stilton', 'pigeons', 17, 'color', 'blue', 'why', 'because I said so']. 这需要一个列表,例如[“奶酪”,“斯蒂尔顿”,“鸽子”,17,“颜色”,“蓝色”,“为什么”,“因为我这么说”]。 But I can't figure out a regex to get in this form. 但是我不知道要使用这种形式的正则表达式。 Closest I have is 我最近的是

([^;[\s]*]+)

Which returns 哪个返回

['', 'cheese', ' ', '"stilton"', ';', 'pigeons', ' ', '17', '; ', 'color', ' ', '"blue"', '; ', 'why', ' ', '"because', ' ', 'I', ' ', 'said', ' ', 'so"', ';']

Of course, it's easy enough to iterate by threes and pick the key/value pairs and ignore the captured delimiters, but I'm wondering if there's a different regex that would not capture the delimiters. 当然,迭代三并选择键/值对并忽略捕获的定界符很容易,但是我想知道是否存在不捕获定界符的正则表达式。 Any suggestions? 有什么建议么?

It might be easier to use findall() instead of split() here. 在这里使用findall()而不是split()可能会更容易。 This will let you use a capture group to pull out just the part you want. 这样,您就可以使用捕获组仅提取所需的部分。 Then you can split the groups, cleanup, etc: 然后,您可以拆分组,清理等:

import re
s = 'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'
pairs = re.findall(r'(\S+?) (.+?);', s)

d = {}
for k, v in pairs:
    if  v.isdigit():
        v = int(v)
    else:
        v = v.strip('"')
    d[k] = v
print(d)

result 结果

{'cheese': 'stilton',
 'pigeons': 17,
 'color': 'blue',
 'why': 'because I said so'}

This, of course, assumes you aren't using ; 当然,这假设您没有使用; anywhere in the data. 数据中的任何地方。

You may use 您可以使用

r'(\w+)\s+("[^"]*"|[^\s;]+)'

to match and extract your data with re.findall , and post-process Group 2 values to remove one trailing and one leading " chars if the first alternative matched, and then create a dictionary entry. 使用re.findall匹配并提取数据, re.findall第2组值进行后处理,以在第一个备选匹配的情况下删除一个尾随和一个前导"字符,然后创建字典条目。

See the regex demo . 参见regex演示

Details 细节

  • (\\w+) - Group 1 (key): one or more word chars (\\w+) -第1组(键):一个或多个单词字符
  • \\s+ - 1+ whitespace chars \\s+ -1+空格字符
  • ("[^"]*"|[^\\s;]+) - Group 2: " , 0+ chars other than " and then a " or 1 or more chars other than whitespace and ; ("[^"]*"|[^\\s;]+) -组2: " ,除了"以外的0个字符,然后是"或除空格和之外的1个或多个字符;

Python demo : Python演示

import re
rx = r'(\w+)\s+("[^"]*"|[^\s;]+)'
s = 'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'
result = {}
for key,val in re.findall(rx, s):
    if val.startswith('"') and val.endswith('"'):
        val = val[1:-1]
    result[key]=val

print(result)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM