简体   繁体   English

多行正则表达式模式匹配

[英]multiline regex pattern match

I have the following multiline(?) string that I get from the output of a process. 我有以下从过程输出中获取的multiline(?)字符串。

04/18@14:22 - RESPONSE from 192.68.10.1 : 04/18 @ 14:22-从192.68.10.1起响应:
04/18@14:22 - RESPONSE from 192.68.10.1 : 04/18 @ 14:22-从192.68.10.1起响应:
TSB1 File Name: OCAP_TSB_76 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB1 Duration: 1752 seconds 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB1 Bit Rate: 3669 kbps 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB1文件名:OCAP_TSB_76 04/18 @ 14:22-从192.68.10.1起响应:TSB1持续时间:1752秒04/18 @ 14:22-从192.68.10.1起响应:TSB1比特率:3669 kbps 04/18 @ 14: 22-从192.68.10.1起的响应:
04/18@14:22 - RESPONSE from 192.68.10.1 : 04/18 @ 14:22-从192.68.10.1起响应:
TSB2 File Name: OCAP_TSB_80 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB2 Duration: 56 seconds 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB2 Bit Rate: 3675 kbps 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB2文件名:OCAP_TSB_80 04/18 @ 14:22-从192.68.10.1起响应:TSB2持续时间:56秒04/18 @ 14:22-从192.68.10.1起响应:TSB2比特率:3675 ​​kbps 04/18 @ 14: 22-从192.68.10.1起的响应:

I am trying to extract just the values in 'seconds' and 'kbps'. 我正在尝试仅提取“秒”和“ kbps”中的值。

This is what I have so far. 到目前为止,这就是我所拥有的。

>>> cpat = re.compile(r"\.*RESPONSE from[^:]+:\s*TSB[\d] Duration:\s*(\d+) seconds\.*?RESPONSE from[^:]+:\s*TSB[\d] Bit Rate:\s*(\d+) kbps", re.DOTALL)
>>> m = re.findall(cpat,txt)
>>> m
[]

I find matches if I break the regex into separate parts. 如果将正则表达式分解为单独的部分,我会找到匹配项。 But, I am looking to find matches like below 但是,我正在寻找符合以下条件的比赛

m [(1752,3669),(52,3675)] m [(1752,3669),(52,3675)]

Thanks a lot! 非常感谢!

re.compile(r"\.*RESPONSE from[^:]+:\s*TSB[\d] Duration:\s*(\d+) seconds\.*?RESPONSE from[^:]+:\s*TSB[\d] Bit Rate:\s*(\d+) kbps", re.DOTALL)
                                                                       ^

I think that this dot was not meant to be escaped (because otherwise, it will be matching literal dots instead of any character. Try with: 我认为该点不是要转义的(否则,它将匹配文字点而不是任何字符。请尝试:

re.compile(r"\.*RESPONSE from[^:]+:\s*TSB[\d] Duration:\s*(\d+) seconds.*?RESPONSE from[^:]+:\s*TSB[\d] Bit Rate:\s*(\d+) kbps", re.DOTALL)

Also, there are some unnecessary parts in your regex that you can remove and still ensure the matches you're looking for. 另外,您的正则表达式中有一些不必要的部分可以删除,但仍然可以确保找到所需的匹配项。 I removed them in the below regex: 我在下面的正则表达式中删除了它们:

re.compile(r"RESPONSE from[^:]+:\s*TSB\d Duration:\s*(\d+) seconds.*?RESPONSE from[^:]+:\s*TSB\d Bit Rate:\s*(\d+) kbps", re.DOTALL)

Namely: 即:

  • You don't need .* at the start of the regex with re.findall . 使用re.findall在正则表达式的开头不需要.*
  • You don't need to put \\d within square brackets if it is alone. 如果单独使用\\d无需将\\d放在方括号中。

This code gives what you want: 这段代码给出了您想要的:

import re 汇入

data = '''
04/18@14:22 - RESPONSE from 192.68.10.1 :
04/18@14:22 - RESPONSE from 192.68.10.1 :
TSB1 File Name: OCAP_TSB_76 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB1 Duration: 1752 seconds 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB1 Bit Rate: 3669 kbps 04/18@14:22 - RESPONSE from 192.68.10.1 :
04/18@14:22 - RESPONSE from 192.68.10.1 :
TSB2 File Name: OCAP_TSB_80 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB2 Duration: 56 seconds 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB2 Bit Rate: 3675 kbps 04/18@14:22 - RESPONSE from 192.68.10.1 :
'''

output = []
block_pattern = re.compile(r'(\d+\/\d+@\d+:\d+ - RESPONSE.*?)(.*)')
seconds_speed_pattern = re.compile(r'TSB.*Duration:(.*)seconds.*TSB.*Bit Rate:(.*)kbps')
blocks = re.findall(block_pattern, data)
for block in blocks:
    ss_data = re.findall(seconds_speed_pattern, block[1])
    if ss_data:
        output.append(ss_data[0])

print output

This prints 此打印

[(' 1752 ', ' 3669 '), (' 56 ', ' 3675 ')]

In order to convert those values from str to int s just do: 为了将这些值从str转换为int只需执行以下操作:

output = [(int(a.strip()), int(b.strip())) for a, b  in output]

This gives: 这给出:

[(1752, 3669), (56, 3675)]
result = re.findall(r"(?sim)Duration: (\d+).*?Rate: (\d+)", subject)


Options: dot matches newline; case insensitive; ^ and $ match at line breaks

Match the characters “Duration: ” literally «Duration: »
Match the regular expression below and capture its match into backreference number 1 «(\d+)»
   Match a single digit 0..9 «\d+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character «.*?»
   Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the characters “Rate: ” literally «Rate: »
Match the regular expression below and capture its match into backreference number 2 «(\d+)»
   Match a single digit 0..9 «\d+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM