[英]How to parse and match with multiple regexes
I have an input data of the form:我有一个表单的输入数据:
[2] IN: 2.12 INOUT: 3.52 (Input)
[2] IN: 2.12 INOUT: 3.52 (Input)
OUT: 2.42 INOUT: 2.62 (Output)
[2] OUT: 2.42 INOUT: 2.62 (Output)
[2] IN: 2.12 INOUT: 3.52 (Input)
OUT: 2.42 INOUT: 2.62 (Output)
[2] IN: 2.12 INOUT: 3.52 (Input)
[2] OUT: 2.42 INOUT: 2.62 (Output)
[2] IN: 2.12 INOUT: 3.52 (Input)
OUT: 2.42 INOUT: 2.62 (Output)
I need to parse through this data and the IN: / OUT: /INOUT: depending on three regexes given as:我需要解析这些数据和 IN: / OUT: /INOUT: 取决于给出的三个正则表达式:
regex1 = r"\[2\]\s*IN:\s*(\S+?)\s*INOUT:\s*(\S+?)\s"
regex2 = r"\[2\]\s*OUT:\s*(\S+?)\s*INOUT:\s*(\S+?)\s"
regex3 = r"\[2\]\s*IN:\s*(\S+?)\s*INOUT:\s*(\S+?)\s.*?.\s*OUT:\s*(\S+?)\s*INOUT:\s*(\S+?)\s"
My output should be:我的输出应该是:
IN_r1 2.12 INOUT_r1 3.52
IN_r3 2.12 INOUT1_r3 3.52 OUT_r3 2.42 INOUT2_r3 2.62
OUT_r2 2.42 INOUT_r2 2.62
IN_r3 2.12 INOUT1_r3 3.52 OUT_r3 2.42 INOUT2_r3 2.62
IN_r1 2.12 INOUT_r1 3.52
OUT_r2 2.42 INOUT_r2 2.62
IN_r3 2.12 INOUT1_r3 3.52 OUT_r3 2.42 INOUT2_r3 2.62
The code I tried:我试过的代码:
import re
regex1 = r"\[2\]\s*IN:\s*(\S+?)\s*INOUT:\s*(\S+?)\s"
regex2 = r"\[2\]\s*OUT:\s*(\S+?)\s*INOUT:\s*(\S+?)\s"
regex3 = r"\[2\]\s*IN:\s*(\S+?)\s*INOUT:\s*(\S+?)\s.*?.\s*OUT:\s*(\S+?)\s*INOUT:\s*(\S+?)\s"
data = "
[2] IN: 2.12 INOUT: 3.52 (Input)
[2] IN: 2.12 INOUT: 3.52 (Input)
OUT: 2.42 INOUT: 2.62 (Output)
[2] OUT: 2.42 INOUT: 2.62 (Output)
[2] IN: 2.12 INOUT: 3.52 (Input)
OUT: 2.42 INOUT: 2.62 (Output)
[2] IN: 2.12 INOUT: 3.52 (Input)
[2] OUT: 2.42 INOUT: 2.62 (Output)
[2] IN: 2.12 INOUT: 3.52 (Input)
OUT: 2.42 INOUT: 2.62 (Output)
"
lines = re.split("\[2]",data)
for line in lines:
if re.search(regex1,data):
tracks = re.findall(regex1,data,re.DOTALL)
for track in tracks:
input,inout = (float(z) for z in track)
with open("checked_ant.txt",'a') as a:
print("IN_r1",input,"INOUT_r1",inout,file=a)
elif re.search(regex2,data):
tracks = re.findall(regex2,data,re.DOTALL)
for track in tracks:
output,inout = (float(z) for z in track)
with open("checked_ant.txt",'a') as a:
print("OUT_r2",output,"INOUT_r2",inout,file=a)
elif re.search(regex3,data):
tracks = re.findall(regex3,data,re.DOTALL)
for track in tracks:
input,inout1,output,inout2 = (float(z) for z in track)
with open("checked_ant.txt",'a') as a:
print("IN_r3",input,"INOUT1_r3",inout1,"OUT_r3",output,"INOUT2_r3",inout2,file=a)
The problem I face is that it does not parse correctly and it is not getting matched for each subdata beginning with [2]我面临的问题是它没有正确解析,并且没有为每个以 [2] 开头的子数据匹配
Though I find the requirement quite strange(regex is provided and cannot change), I got the expected result.虽然我觉得这个要求很奇怪(提供了正则表达式并且不能改变),但我得到了预期的结果。 Can you try.
你能试一下吗。
import re
s = '''[2] IN: 2.12 INOUT: 3.52 (Input)
[2] IN: 2.12 INOUT: 3.52 (Input)
OUT: 2.42 INOUT: 2.62 (Output)
[2] OUT: 2.42 INOUT: 2.62 (Output)
[2] IN: 2.12 INOUT: 3.52 (Input)
OUT: 2.42 INOUT: 2.62 (Output)
[2] IN: 2.12 INOUT: 3.52 (Input)
[2] OUT: 2.42 INOUT: 2.62 (Output)
[2] IN: 2.12 INOUT: 3.52 (Input)
OUT: 2.42 INOUT: 2.62 (Output)'''
r1 = r"\[2\]\s*IN:\s*(\S+?)\s*INOUT:\s*(\S+?)\s"
r2 = r"\[2\]\s*OUT:\s*(\S+?)\s*INOUT:\s*(\S+?)\s"
r3 = r"\[2\]\s*IN:\s*(\S+?)\s*INOUT:\s*(\S+?)\s.*?.\s*OUT:\s*(\S+?)\s*INOUT:\s*(\S+?)\s"
def g(reg, s, n):
return float(re.search(reg, s).group(n))
paras = s.split('\n\n')
for p in paras:
if re.search(r1, p):
print(f'IN_r1 {g(r1, p, 1)} INOUT_r1 {g(r1, p, 2)}')
if re.search(r2, p):
print(f'OUT_r2 {g(r2, p, 1)} INOUT_r2 {g(r2, p, 2)}')
if re.search(r3, p):
print(
f'IN_r3 {g(r3, p, 1)} INOUT1_r3 {g(r3, p, 2)} OUT_r3 {g(r3, p, 3)} INOUT2_r3 {g(r3, p, 4)}')
Update更新
For better performance, you can match only once, and get the groups.为了获得更好的性能,您只能匹配一次,并获得组。 Take r1 as example:
以 r1 为例:
gs = re.search(r1, p)
if gs:
print(f'IN_r1 {gs.group(1)} INOUT_r1 {gs.group(2)}')
Here is a regex find all approach.这是一个正则表达式查找所有方法。 We can first search for each multiline section beginning with
[2]
, then find all data numbers and print them out in a single line.我们可以首先搜索以
[2]
开头的每个多行部分,然后找到所有数据编号并将它们打印在一行中。
import re
inp = """[2] IN: 2.12 INOUT: 3.52 (Input)
[2] IN: 2.12 INOUT: 3.52 (Input)
OUT: 2.42 INOUT: 2.62 (Output)
[2] OUT: 2.42 INOUT: 2.62 (Output)
[2] IN: 2.12 INOUT: 3.52 (Input)
OUT: 2.42 INOUT: 2.62 (Output)
[2] IN: 2.12 INOUT: 3.52 (Input)
[2] OUT: 2.42 INOUT: 2.62 (Output)
[2] IN: 2.12 INOUT: 3.52 (Input)
OUT: 2.42 INOUT: 2.62 (Output)"""
first = 1
for m in re.finditer(r'\[\d+\](.*?)(?=\[\d+\]|$)', inp, flags=re.DOTALL):
nums = re.findall(r'\d+(?:\.\d+)?', m.group(1))
if first != 1:
print('')
print(' '.join(nums), end='')
first = 0
This prints:这打印:
2.12 3.52
2.12 3.52 2.42 2.62
2.42 2.62
2.12 3.52 2.42 2.62
2.12 3.52
2.42 2.62
2.12 3.52 2.42 2.62
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.