简体   繁体   English

正则表达式找不到预期的输出

[英]Regex findall output not as expected

Tried Regex to extract parts of text which is read from a .txt file. 尝试使用正则表达式来提取从.txt文件中读取的部分文本。 However my method seems to fail some specific lines. 但是我的方法似乎失败了一些特定的行。

Below are 3 lines from input text 以下是输入文本中的3行

[2019/07/11 18:52:25.391] Receive : <- AI (Req No. 711185105702666 ) Message from : cop10

[2019/07/11 18:52:25.391] Note    : Response that is not being sent ... cop10

[2019/07/11 18:52:25.393] ★Err    : subargs[0] : IBSDK_7776

below is code to extract some portion of text after the time stamp. 下面是在时间戳之后提取文本的一部分的代码。

regex = r"\[.{23}] ?(.{1,8}:.{1,12}).*\n"
pattern = re.compile(regex)
for line in input_text: 
    matches = pattern.findall(line)
    print('matches is {}'.format(matches))

"For lines 1 and 2 in the input text, the output is as expected ie a list of extracted text." “对于输入文本中的第1行和第2行,输出符合预期,即提取的文本列表。”

Shown below is the output for line 1 下面显示的是第1行的输出

"matches is ['Receive : <- AI (Req ']" “匹配是['接收:< - AI(Req']”

"For the last line the list is empty ie [ ]" “对于最后一行,列表为空,即[]”

"My expectation was ['★Err : subargs[0]'] or list of some text." “我的期望是['★Err:subargs [0]']或一些文字列表。”

I suspect it could be due to the black star in the text as those are places where the code snippet fails,but am not fully sure why it happens. 我怀疑它可能是由于文本中的黑色星星,因为这些是代码片段失败的地方,但我不完全确定它为什么会发生。

Would be great if I can get some input on this and if I need to make changes to my Regex. 如果我能得到一些关于此的信息并且我需要对我的正则表达式进行更改,那将会很棒。

The reason the last line is not being matched is because there is no newline after the last line. 最后一行未匹配的原因是因为最后一行之后没有换行符。

If you want to keep your current pattern you might assert the end of the string $ 如果你想保留当前的模式,你可以断言字符串$的结尾

Your code might look like 您的代码可能看起来像

regex = r"\[.{23}] ?(.{1,8}:.{1,12}).*$"

Regex demo 正则表达式演示

The current pattern does not take a timestamp format into account, it matches 23 times any char except a newline between [ and ] . 当前模式不考虑时间戳格式,它匹配任何char的23倍,除了[]之间的换行符。

You might update your pattern to match your current timestamp format (it does not validate the timestamp), use a negated character class [^:]+: after to match until the : and perhaps omit the match after the capturing group: 您可以更新模式以匹配当前时间戳格式(它不验证时间戳),使用否定字符类[^:]+:匹配后直到:并且可能在捕获组之后省略匹配:

\[\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}\.\d{3}] ?([^:]+:.{1,12})

Regex demo 正则表达式演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM