[英]Python Regex: Parse key/value pairs from a string
I am trying to disect a string like the following into a list of key/value pairs:我正在尝试将如下所示的字符串解析为键/值对列表:
line1 = "keyword1: value1 keyword2: value2 keyword1: value3 keyword3: value4"
I wrote the following code using regex to achieve that goal:我使用正则表达式编写了以下代码来实现该目标:
import re
line1 = "keyword1: value1 keyword2: value2 keyword1: value3 keyword3: value4"
keywords = [ re.escape(k) for k in ['keyword1', 'keyword2', 'keyword3'] ]
any_keyword = '|'.join(keywords)
regex = "(" + any_keyword + "):(.+?)(?:" + any_keyword + "|$)"
print(line1)
print(regex)
for m in re.finditer(regex, line1):
print(m)
The matches I get are我得到的比赛是
<re.Match object; span=(0, 25), match='keyword1: value1 keyword2'>
<re.Match object; span=(34, 59), match='keyword1: value3 keyword3'>
and, of course, they include keyword2 and keyword3 at the end of the string so that I don't get additional Match objects for those keywords.而且,当然,它们在字符串的末尾包含 keyword2 和 keyword3,这样我就不会为这些关键字获得额外的 Match 对象。
How can I receive 4 matches, one for each keyword in the line?我怎样才能收到 4 个匹配项,每个关键字对应一个匹配项?
You may extract the matches using a lookahead rather than a non-capturing group as the last pattern in the regex since a non-capturing group pattern still consumes characters:您可以使用前瞻而不是非捕获组作为正则表达式中的最后一个模式来提取匹配项,因为非捕获组模式仍然消耗字符:
import re
line1 = "keyword1: value1 keyword2: value2 keyword1: value3 keyword3: value4"
keywords = ['keyword1', 'keyword2', 'keyword3']
any_keyword = '|'.join(map(re.escape, keywords))
regex = "(" + any_keyword + "):(.+?)(?=(?:" + any_keyword + "):|$)"
print([m.group() for m in re.finditer(regex, line1)])
# => ['keyword1: value1 ', 'keyword2: value2 ', 'keyword1: value3 ', 'keyword3: value4']
See the Python demo参见Python 演示
If your keys can contain whitespace, make sure to sort the keys before defining any_keyword
pattern by sorting them by length in descending order, eg sorted(keywords,key=len,reverse=True)
.如果您的键可以包含空格,请确保在定义any_keyword
模式之前通过按长度降序对键进行排序,例如sorted(keywords,key=len,reverse=True)
。
It might be a good idea to match keywords as whole words, too:将关键字作为整个词进行匹配也可能是一个好主意:
regex = r"\b(" + any_keyword + r"):(.+?)(?=\b(?:" + any_keyword + "):|$)"
See the regex demo .请参阅正则表达式演示。 Details:细节:
\\b
- a word boundary \\b
- 单词边界(keyword1|keyword2|keyword3)
- Group 1: keyword alternatives (keyword1|keyword2|keyword3)
- 第 1 组:关键字替代:
- a :
char :
- a :
字符(.+?)
- Group 2: any one or more chars other than line break chars as few as possible (.+?)
- 第 2 组:除换行符以外的任何一个或多个字符尽可能少(?=\\b(?:keyword1|keyword2|keyword3):|$)
- a positive lookahead that makes sure that, immediately to the right of the current position, there is either (?=\\b(?:keyword1|keyword2|keyword3):|$)
- 正向前瞻,确保在当前位置的右侧,有
\\b(?:keyword1|keyword2|keyword3):
- any keyword from the list followed with :
\\b(?:keyword1|keyword2|keyword3):
- 列表中的任何关键字后跟:
|
- or - 或者$
- end of string. $
- 字符串的结尾。One way using re.split
:使用re.split
一种方法:
line1 = "keyword1: value1 keyword2: value2 keyword1: value3 keyword3: value4"
l = re.split('(keyword\d+):', line1)[1:]
[(k,v.strip()) for k, v in zip(l[::2], l[1::2])]
Output:输出:
[('keyword1', 'value1'),
('keyword2', 'value2'),
('keyword1', 'value3'),
('keyword3', 'value4')]
another way其它的办法
>>> regex = "(" + any_keyword + "):\s(\w+)"
>>> pattern = re.compile(regex)
>>> pattern.search(line1)
<_sre.SRE_Match object; span=(0, 16), match='keyword1: value1'>
>>> pattern.findall(line1)
[('keyword1', 'value1'), ('keyword2', 'value2'), ('keyword1', 'value3'), ('keyword3', 'value4')]
>>>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.