[英]How do I parse multi-line logs when I have some regex for individual lines?
I have newline-delimited logs that look like this:我有看起来像这样的换行符分隔的日志:
Unimportant unimportant
Some THREAD-123 blah blah blah patternA blah blah blah
Unimportant unimportant
More THREAD-123 blah blah blah patternB blah blah blah
Unimportant unimportant
Unimportant unimportant
Outbound XML distinctive doctype tag
Unimportant unimportant
Outbound XML distinctive root opening-tag
Unimportant unimportant
Unimportant unimportant
Unimportant unimportant
Outbound XML distinctive HEY-THIS-IS-MY-DATA tagset and innertext
Unimportant unimportant
Outbound XML distinctive root closing-tag
Unimportant unimportant
Unimportant unimportant
Unimportant unimportant
Yet more THREAD-123 blah blah blah patternC blah blah blah
Unimportant unimportant
Unimportant unimportant
Even more THREAD-123 blah blah blah patternD blah blah blah
Unimportant unimportant
Inbound XML distinctive snippet
Unimportant unimportant
Unimportant unimportant
Unimportant unimportant
Just a bit more THREAD-123 blah blah blah patternE blah blah blah
Unimportant unimportant
Unimportant unimportant
And then THREAD-123 blah blah blah patternF blah blah blah
Unimportant unimportant
I've already come up with ^...$
regex patterns capable of recognizing every line you see here that isn't " Unimportant unimportant
", with one caveat:我已经想出了
^...$
正则表达式模式,能够识别你在这里看到的每一行不是“ Unimportant unimportant
”,有一个警告:
Sometimes, things that match one of these patterns will themselves be unimportant.有时,符合其中一种模式的事物本身并不重要。
Like, there might be overlapping concurrent threads that both match this pattern.比如,可能存在同时匹配此模式的重叠并发线程。
So once I see a " Some THREAD-(\d+) blah blah blah patternA blah blah blah
" I'll need to save off " (\d+)
"'s value of " 123
" from " THREAD-(\d+)
" into some sort of variable and use it as a literal in subsequent patternB-patternF (actually look for " THREAD-123
") .因此,一旦我看到“
Some THREAD-(\d+) blah blah blah patternA blah blah blah
”,我就需要从“ THREAD-(\d+)
”中保存“ (\d+)
”的“ 123
”值进入某种变量并将其用作后续 patternB-patternF 中的文字(实际上查找“ THREAD-123
”) 。
Furthermore, I need to pass in a parameter to the whole thing where I've written " HEY-THIS-IS-MY-DATA
."此外,我需要将一个参数传递给我写“
HEY-THIS-IS-MY-DATA
”的整个东西。
In other words, I'm looking for " HEY-THIS-IS-MY-DATA
" surrounded by a consistent "opening" and "closing" sequences of regexes in a log file.换句话说,我在日志文件中寻找被一致的“打开”和“关闭”正则表达式序列包围的“
HEY-THIS-IS-MY-DATA
”。
Any tips on how I could approach this?关于如何解决这个问题的任何提示?
Extremely vanilla Python 3 (as delivered on 2021-era AWS EC2 RHLE instances), older (v5) PowerShell, or Linux shell flavors that come with standard 2021-era AWS EC2 RHLE instances would be my preferred programming languages, as I'll be passing this on for others to use as a unit test for validating whether certain behaviors against " HEY-THIS-IS-MY-DATA
" in an interactive UI "show up correctly" in logs.极其普通的 Python 3(在 2021 年 AWS EC2 RHLE 实例上交付)、较旧的 (v5) PowerShell 或标准 2021 年 AWS EC2 RHLE 实例附带的 Linux shell 风格将是我的首选编程语言,因为我会将此传递给其他人以用作单元测试,以验证交互式 UI 中针对“
HEY-THIS-IS-MY-DATA
”的某些行为是否在日志中“正确显示”。
It's ugly, but it seems to work.这很丑陋,但似乎有效。
I realized that if I just keep whacking the beginning off the logs any time I find the first instance of a thing I'm looking for, and then keep looking for more of it, I should be all right.我意识到,只要在我找到我正在寻找的事物的第一个实例时,我就继续从日志开始敲击,然后继续寻找更多,我应该没问题。
First I throw away all lines of the log file that don't even match any of the 11 regexes.首先,我丢弃了日志文件中与 11 个正则表达式都不匹配的所有行。 Meanwhile, I also cache the thread numbers involved in the matching regexes.
同时,我还缓存了匹配正则表达式所涉及的线程号。
Then I loop through the remaining log lines.然后我遍历剩余的日志行。 I start with a modified regex #0 (the first cached thread number in the place of
\d+
) , see if I find an instance of it, chop off everything before that, keep looking for modified regex #1 from there, repeat repeat repeat.我从修改后的正则表达式 #0 (第一个缓存的线程号代替
\d+
)开始,看看我是否找到它的一个实例,砍掉之前的所有内容,继续从那里寻找修改后的正则表达式 #1,重复重复重复.
Do that for as many variants on the regex-set as there are thread numbers in the cache.对正则表达式集中的变体执行此操作,因为缓存中有线程编号。
Error out if I don't find all 11 regexes, in order, based on this find-and-chop method.如果我没有找到所有 11 个正则表达式,则根据这个查找和切碎方法按顺序出错。
(Note: I just realized this code errors out prematurely if there's more than 1 thread number and the all-11 match isn't in the first thread number processed. I'll have to fix that. Should've tested against a bigger log. Oops.) (注意:我刚刚意识到如果有超过 1 个线程号并且全 11 匹配不在处理的第一个线程号中,此代码会过早出错。我必须解决这个问题。应该针对更大的日志进行测试. 哎呀。)
from collections import OrderedDict
from itertools import islice
import re
hey_this_is_my_data = 'my_data'
filepath = 'c:\\example\\log.txt'
class LogDidNotMatchException(Exception):
Exception
pass
logstart = re.compile(r'^start_of_every_log_line (.*)$')
def get_od(thread_number_pattern):
returnme = OrderedDict()
returnme[0] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO \[Thread-(' + thread_number_pattern + r')\] - patterna$')
returnme[1] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO \[Thread-(' + thread_number_pattern + r')\] - patternb$')
returnme[2] = re.compile(r'^start_of_every_log_line <!DOCTYPE root_type SYSTEM "[\.\w]+">$')
returnme[3] = re.compile(r'^start_of_every_log_line <root_type>$')
returnme[4] = re.compile(r'^start_of_every_log_line <DataId>' + hey_this_is_my_data + r'<\/DataId>$')
returnme[5] = re.compile(r'^start_of_every_log_line </root_type>$')
returnme[6] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO \[Thread-(' + thread_number_pattern + r')\] - patternc$')
returnme[7] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO \[Thread-(' + thread_number_pattern + r')\] - patternd$')
returnme[8] = re.compile(r'^start_of_every_log_line <response><Reply><Result status="success" \/><\/Reply><\/response>$')
returnme[9] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO \[Thread-(' + thread_number_pattern + r')\] - patterne$')
returnme[10] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} dumdeedum INFO \[Thread-(' + thread_number_pattern + r')\] - patternf$')
return returnme
def filter_lines(enumerated_lines, regex_od):
returnme_kept_lines = OrderedDict()
returnme_thread_numbers = []
for line_number, line in enumerated_lines.items():
not_yet_kept = True
for rgx_num, rgx in regex_od.items():
if not_yet_kept and rgx.search(line):
not_yet_kept = False
returnme_kept_lines[line_number] = line
if rgx_num == 0:
returnme_thread_numbers.append(rgx.match(line).group(1))
return returnme_kept_lines, returnme_thread_numbers
with open(filepath, 'r') as f:
lines = f.readlines()
first_od = get_od(r'\d+')
kept_lines, thread_numbers = filter_lines(OrderedDict(enumerate(lines)), first_od)
def find_first_regex_occurrence_in_linesod(the_lines_od, the_regex):
line_number_found_regex_on = -1
if the_lines_od is None or len(the_lines_od) == 0 or the_regex is None:
return line_number_found_regex_on
for i, (line_num, line) in enumerate(the_lines_od.items()):
if line_number_found_regex_on == -1:
#print('loopline', i, 'fileline:', line_num, 'lineslen:', len(the_lines_od))
if the_regex.search(line):
line_number_found_regex_on = i
#print(f'found on {i}')
return line_number_found_regex_on
def recursively_process_subset(collector, lines_od, rgx_od, curr_rgx_key):
#print('\n', 'recursiongo', 'lineslen:', len(lines_od), 'currregexno:', curr_rgx_key)
if curr_rgx_key > len(rgx_od):
return collector # Recursion base condition
if len(lines_od) == 0:
if curr_rgx_key < len(rgx_od):
raise LogDidNotMatchException(f'Never got through regex key {curr_rgx_key}')
else:
return collector # Recursion base condition
line_number_found_currod_on = -1
line_number_found_currod_on = find_first_regex_occurrence_in_linesod(lines_od, rgx_od[curr_rgx_key])
if line_number_found_currod_on == -1:
raise LogDidNotMatchException(f'Short-circuited trying to find regex key {curr_rgx_key}')
#print(f'recursion found for regex key {curr_rgx_key} on line {line_number_found_currod_on} of {len(lines_od)}-line logsubset')
if (curr_rgx_key + 1) < len(rgx_od):
currodfound_linesod_new_param = OrderedDict(islice(lines_od.items(), line_number_found_currod_on+1, len(lines_od)))
recursively_process_subset(collector, currodfound_linesod_new_param, rgx_od, curr_rgx_key + 1)
try:
for thread_number in thread_numbers:
thread_number_based_od = get_od(str(thread_number))
thread_number_kept_lines, thread_number_thread_numbers = filter_lines(kept_lines, thread_number_based_od)
x = recursively_process_subset([], thread_number_kept_lines, thread_number_based_od, 0)
#print('final', x, 'lenlines:', len(thread_number_kept_lines))
print(f'Success: All {len(first_od)} expected patterns were found in the log, in order, for ID {hey_this_is_my_data}.')
except LogDidNotMatchException:
print(f'Failure: Not all {len(first_od)} expected patterns were found as expected in the log for ID {hey_this_is_my_data}. Below are lines that seemed close but were not quite enough:\n')
{print(f'log line #{line_number}: {line}') for line_number, line in kept_lines.items()}
print(f'Failure: Not all {len(first_od)} expected patterns were found as expected in the log for ID {hey_this_is_my_data}. Above are lines that seemed close but were not quite enough.')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.