正则表达式以键/值对解析定界字符串（Python）

Question

I have data in text format, where key/value pairs are separated by semi-colon, may be followed by whitespace, maybe not, eg, ";" 我有文本格式的数据，其中键/值对之间用分号分隔，后面可能跟空格，例如，可能不是“;” or "; ", or even " ; ". 或“;”，甚至“;”。 There will always be a semi-colon between pairs, and the string is terminated with a semi-colon. 线对之间始终会有分号，并且字符串以分号结尾。

Keys and values are separated by whitespace. 键和值由空格分隔。

This string is flat. 这根弦是扁平的。 There's never anything nested. 没有任何嵌套。 Strings are always quoted and numerical values are never quoted. 字符串总是用引号引起来，而数值则不用引号引起来。 I can count on this being consistent in the input. 我可以指望这在输入中保持一致。 So for example, 例如

'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'

Ultimately this winds up as 最终，随着

{'cheese': "stilton", 'pigeons': 17, 'color': "blue"; 'why': "because I said so"}

Different strings may include different key/value pairs, and I can't know in advance which keys will be present. 不同的字符串可能包含不同的键/值对，因此我无法事先知道将出现哪些键。 So this is equally valid input string: 因此，这是同样有效的输入字符串：

mass 6.02 ; mammal "gerbil";telephone "+1 903 555-1212"; size "A1";

I'm thinking that a regex to split the string into a list would be a good start, then just iterate through the list by twos to build the dictionary. 我认为正则表达式可以将字符串拆分为一个列表，这是一个好的开始，然后只需将列表遍历两次就可以构建字典。 Something like 就像是

x = PATTERN.split(s)
d = {}
for i in range(0, len(x), 2):
    d[x[i]] = d[x[i+1]]

Which requires a list like ['cheese', 'stilton', 'pigeons', 17, 'color', 'blue', 'why', 'because I said so']. 这需要一个列表，例如[“奶酪”，“斯蒂尔顿”，“鸽子”，17，“颜色”，“蓝色”，“为什么”，“因为我这么说”]。 But I can't figure out a regex to get in this form. 但是我不知道要使用这种形式的正则表达式。 Closest I have is 我最近的是

([^;[\s]*]+)

Which returns 哪个返回

['', 'cheese', ' ', '"stilton"', ';', 'pigeons', ' ', '17', '; ', 'color', ' ', '"blue"', '; ', 'why', ' ', '"because', ' ', 'I', ' ', 'said', ' ', 'so"', ';']

Of course, it's easy enough to iterate by threes and pick the key/value pairs and ignore the captured delimiters, but I'm wondering if there's a different regex that would not capture the delimiters. 当然，迭代三并选择键/值对并忽略捕获的定界符很容易，但是我想知道是否存在不捕获定界符的正则表达式。 Any suggestions? 有什么建议么？

Answer 1

It might be easier to use findall() instead of split() here. 在这里使用findall()而不是split()可能会更容易。 This will let you use a capture group to pull out just the part you want. 这样，您就可以使用捕获组仅提取所需的部分。 Then you can split the groups, cleanup, etc: 然后，您可以拆分组，清理等：

import re
s = 'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'
pairs = re.findall(r'(\S+?) (.+?);', s)

d = {}
for k, v in pairs:
    if  v.isdigit():
        v = int(v)
    else:
        v = v.strip('"')
    d[k] = v
print(d)

result 结果

{'cheese': 'stilton',
 'pigeons': 17,
 'color': 'blue',
 'why': 'because I said so'}

This, of course, assumes you aren't using ; 当然，这假设您没有使用; anywhere in the data. 数据中的任何地方。

Answer 2

You may use 您可以使用

r'(\w+)\s+("[^"]*"|[^\s;]+)'

to match and extract your data with re.findall , and post-process Group 2 values to remove one trailing and one leading " chars if the first alternative matched, and then create a dictionary entry. 使用re.findall匹配并提取数据， re.findall第2组值进行后处理，以在第一个备选匹配的情况下删除一个尾随和一个前导"字符，然后创建字典条目。

See the regex demo . 参见regex演示。

Details 细节

(\\w+) - Group 1 (key): one or more word chars (\\w+) -第1组（键）：一个或多个单词字符
\\s+ - 1+ whitespace chars \\s+ -1+空格字符
("[^"]*"|[^\\s;]+) - Group 2: " , 0+ chars other than " and then a " or 1 or more chars other than whitespace and ; ("[^"]*"|[^\\s;]+) -组2： " ，除了"以外的0个字符，然后是"或除空格和之外的1个或多个字符;

Python demo : Python演示：

import re
rx = r'(\w+)\s+("[^"]*"|[^\s;]+)'
s = 'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'
result = {}
for key,val in re.findall(rx, s):
    if val.startswith('"') and val.endswith('"'):
        val = val[1:-1]
    result[key]=val

print(result)

正则表达式以键/值对解析定界字符串（Python）

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-05-25 16:35:41

解决方案2
1 2019-05-25 16:42:27

正则表达式以键/值对解析定界字符串（Python）

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-05-25 16:35:41

解决方案2 1 2019-05-25 16:42:27

解决方案1
1 已采纳 2019-05-25 16:35:41

解决方案2
1 2019-05-25 16:42:27