Python 正则表达式 - 捕获重复模式组

Question

I have a log file that I am trying to parse.我有一个正在尝试解析的日志文件。 Example of log file is below:日志文件示例如下：

Oct 23 13:03:03.714012 prod1_xyz(RSVV)[201]: #msgtype=EVENT #server=Web/Dev@server1web #func=LKZ_WriteData ( line 2992 ) #rc=0 #msgid=XYZ0064 #reqid=0 #msg=Web Activity end (section 200, # SysD 1, Files 222, Bytes 343422089928, Errors 0, Aborted Files 0, Busy Files 0) 10 月 23 日 13:03:03.714012 prod1_xyz(RSVV)[201]：#msgtype=EVENT #server=Web/Dev@server1web #func=LKZ_WriteData（第 2992 行）#rc=0 #msgid=XYZ0064 #reqid=0 #msg= Web 活动结束（第 200 节，# SysD 1，文件 222，字节 343422089928，错误 0，中止文件 0，忙碌文件 0）

I want to pull out all the text that start with a hash, and have a key and value.我想提取所有以 hash 开头的文本，并有一个键和值。 For example, #msgtype=EVENT.例如，#msgtype=EVENT。 Any text that has a hash only, and no "=" sign, will be treated as a value.任何仅具有 hash 且没有“=”符号的文本都将被视为一个值。

So in the above log entry, I want a list that looks like this所以在上面的日志条目中，我想要一个看起来像这样的列表

#msgtype=EVENT
#server=Web/Dev@server1web
#func=LKZ_WriteData ( line 2992 ) 
#rc=0
#msgid=XYZ0064 
#reqid=0
#msg=Web Activity end (section 200, # SysD 1, Files 222, Bytes 343422089928, Errors 0, Aborted Files 0, Busy Files 0) (Notice the hash present in the middle of the text)

I have tried the Python regex findall option, but I am not able to capture all data.我已经尝试过 Python 正则表达式 findall 选项，但我无法捕获所有数据。

For example:例如：

str='Oct 23 13:03:03.714012 prod1_xyz(RSVV)[201]: #msgtype=EVENT #server=Web/Dev@server1web #func=LKZ_WriteData ( line 2992 ) #rc=0 #msgid=XYZ0064 #reqid=0 #msg=Web Activity end (section 200, # SysD 1, Files 222, Bytes 343422089928, Errors 0, Aborted Files 0, Busy Files 0)'

z = re.findall("(#.+?=.+?)(:?#|$)",str)
print(z)

Output: Output：

[('#msgtype=EVENT ', '#'), ('#func=LKZ_WriteData ( line 2992 ) ', '#'), ('#msgid=XYZ0064 ', '#'), ('#msg=Web Activity end (section 200, ', '#')]

Answer 1

The (:?#|$) is a capturing group that matches an optional : and then # , or end of string. (:?#|$)是一个捕获组，它匹配一个可选的: ，然后是# ，或者字符串的结尾。 Since re.findall returns all captured substrings the result is a list of tuples.由于re.findall返回所有捕获的子字符串，因此结果是一个元组列表。

You need你需要

re.findall(r'#[^\s=]+=.*?(?=\s*#[^\s=]+=|$)', text)

See the regex demo查看正则表达式演示

Regex details正则表达式详细信息

#[^\s=]+ - # and then any 1+ chars other than whitespace and = #[^\s=]+ - #然后是除空格和=之外的任何 1+ 个字符
= - a = char = - a =字符
.*? - any 0+ chars other than line break chars, as few as possible - 除换行符以外的任何 0+ 字符，尽可能少
(?=\s*#[^\s=]+=|$) - up to (and excluding) 0+ whitespaces, # , 1+ chars other than whitespace and = and then = or up the end of string. (?=\s*#[^\s=]+=|$) - 最多（且不包括）0+ 个空格、 # 、1+ 个除空格和=之外的字符，然后=或字符串末尾。

Answer 2

import re

s = "Oct 23 13:03:03.714012 prod1_xyz(RSVV)[201]: #msgtype=EVENT #server=Web/Dev@server1web #func=LKZ_WriteData ( line 2992 ) #rc=0 #msgid=XYZ0064 #reqid=0 #msg=Web Activity end (section 200, # SysD 1, Files 222, Bytes 343422089928, Errors 0, Aborted Files 0, Busy Files 0)"

a = re.findall('#(?=[a-zA-Z]+=).+?=.*?(?= #[a-zA-Z]+=|$)', s)

result = [item.split('=') for item in a]

print(result)

Gives:给出：

[['#msgtype', 'EVENT'], ['#server', 'Web/Dev@server1web'], ['#func', 'LKZ_WriteData ( line 2992 )'], ['#rc', '0'], ['#msgid', 'XYZ0064'], ['#reqid', '0'], ['#msg', 'Web Activity end (section 200, # SysD 1, Files 222, Bytes 343422089928, Errors 0, Aborted Files 0, Busy Files 0)']]

Python 正则表达式 - 捕获重复模式组

问题描述

2 个解决方案

解决方案1
1 2019-10-25 14:38:35

解决方案2
0 2019-10-25 14:39:10

Python 正则表达式 - 捕获重复模式组

问题描述

2 个解决方案

解决方案1 1 2019-10-25 14:38:35

解决方案2 0 2019-10-25 14:39:10

解决方案1
1 2019-10-25 14:38:35

解决方案2
0 2019-10-25 14:39:10