当我对单行有一些正则表达式时，如何解析多行日志？

Question

I have newline-delimited logs that look like this:我有看起来像这样的换行符分隔的日志：

Unimportant unimportant
Some THREAD-123 blah blah blah patternA blah blah blah
Unimportant unimportant
More THREAD-123 blah blah blah patternB blah blah blah
Unimportant unimportant
Unimportant unimportant
Outbound XML distinctive doctype tag
Unimportant unimportant
Outbound XML distinctive root opening-tag
Unimportant unimportant
Unimportant unimportant
Unimportant unimportant
Outbound XML distinctive HEY-THIS-IS-MY-DATA tagset and innertext
Unimportant unimportant
Outbound XML distinctive root closing-tag
Unimportant unimportant
Unimportant unimportant
Unimportant unimportant
Yet more THREAD-123 blah blah blah patternC blah blah blah
Unimportant unimportant
Unimportant unimportant
Even more THREAD-123 blah blah blah patternD blah blah blah
Unimportant unimportant
Inbound XML distinctive snippet
Unimportant unimportant
Unimportant unimportant
Unimportant unimportant
Just a bit more THREAD-123 blah blah blah patternE blah blah blah
Unimportant unimportant
Unimportant unimportant
And then THREAD-123 blah blah blah patternF blah blah blah
Unimportant unimportant

I've already come up with ^...$ regex patterns capable of recognizing every line you see here that isn't " Unimportant unimportant ", with one caveat:我已经想出了^...$正则表达式模式，能够识别你在这里看到的每一行不是“ Unimportant unimportant ”，有一个警告：

Sometimes, things that match one of these patterns will themselves be unimportant.有时，符合其中一种模式的事物本身并不重要。

Like, there might be overlapping concurrent threads that both match this pattern.比如，可能存在同时匹配此模式的重叠并发线程。

So once I see a " Some THREAD-(\d+) blah blah blah patternA blah blah blah " I'll need to save off " (\d+) "'s value of " 123 " from " THREAD-(\d+) " into some sort of variable and use it as a literal in subsequent patternB-patternF (actually look for " THREAD-123 ") .因此，一旦我看到“ Some THREAD-(\d+) blah blah blah patternA blah blah blah ”，我就需要从“ THREAD-(\d+) ”中保存“ (\d+) ”的“ 123 ”值进入某种变量并将其用作后续 patternB-patternF 中的文字（实际上查找“ THREAD-123 ”） 。

Furthermore, I need to pass in a parameter to the whole thing where I've written " HEY-THIS-IS-MY-DATA ."此外，我需要将一个参数传递给我写“ HEY-THIS-IS-MY-DATA ”的整个东西。

In other words, I'm looking for " HEY-THIS-IS-MY-DATA " surrounded by a consistent "opening" and "closing" sequences of regexes in a log file.换句话说，我在日志文件中寻找被一致的“打开”和“关闭”正则表达式序列包围的“ HEY-THIS-IS-MY-DATA ”。

Any tips on how I could approach this?关于如何解决这个问题的任何提示？

Extremely vanilla Python 3 (as delivered on 2021-era AWS EC2 RHLE instances), older (v5) PowerShell, or Linux shell flavors that come with standard 2021-era AWS EC2 RHLE instances would be my preferred programming languages, as I'll be passing this on for others to use as a unit test for validating whether certain behaviors against " HEY-THIS-IS-MY-DATA " in an interactive UI "show up correctly" in logs.极其普通的 Python 3（在 2021 年 AWS EC2 RHLE 实例上交付）、较旧的 (v5) PowerShell 或标准 2021 年 AWS EC2 RHLE 实例附带的 Linux shell 风格将是我的首选编程语言，因为我会将此传递给其他人以用作单元测试，以验证交互式 UI 中针对“ HEY-THIS-IS-MY-DATA ”的某些行为是否在日志中“正确显示”。

Answer 1

It's ugly, but it seems to work.这很丑陋，但似乎有效。

I realized that if I just keep whacking the beginning off the logs any time I find the first instance of a thing I'm looking for, and then keep looking for more of it, I should be all right.我意识到，只要在我找到我正在寻找的事物的第一个实例时，我就继续从日志开始敲击，然后继续寻找更多，我应该没问题。

First I throw away all lines of the log file that don't even match any of the 11 regexes.首先，我丢弃了日志文件中与 11 个正则表达式都不匹配的所有行。 Meanwhile, I also cache the thread numbers involved in the matching regexes.同时，我还缓存了匹配正则表达式所涉及的线程号。

Then I loop through the remaining log lines.然后我遍历剩余的日志行。 I start with a modified regex #0 (the first cached thread number in the place of \d+ ) , see if I find an instance of it, chop off everything before that, keep looking for modified regex #1 from there, repeat repeat repeat.我从修改后的正则表达式 #0 （第一个缓存的线程号代替\d+ ）开始，看看我是否找到它的一个实例，砍掉之前的所有内容，继续从那里寻找修改后的正则表达式 #1，重复重复重复.

Do that for as many variants on the regex-set as there are thread numbers in the cache.对正则表达式集中的变体执行此操作，因为缓存中有线程编号。

Error out if I don't find all 11 regexes, in order, based on this find-and-chop method.如果我没有找到所有 11 个正则表达式，则根据这个查找和切碎方法按顺序出错。

(Note: I just realized this code errors out prematurely if there's more than 1 thread number and the all-11 match isn't in the first thread number processed. I'll have to fix that. Should've tested against a bigger log. Oops.) （注意：我刚刚意识到如果有超过 1 个线程号并且全 11 匹配不在处理的第一个线程号中，此代码会过早出错。我必须解决这个问题。应该针对更大的日志进行测试. 哎呀。）

from collections import OrderedDict
from itertools import islice
import re

hey_this_is_my_data = 'my_data'
filepath = 'c:\\example\\log.txt'

class LogDidNotMatchException(Exception):
    Exception
    pass

logstart = re.compile(r'^start_of_every_log_line (.*)$')

def get_od(thread_number_pattern):
    returnme = OrderedDict()
    returnme[0] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO      \[Thread-(' + thread_number_pattern + r')\] - patterna$')
    returnme[1] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO      \[Thread-(' + thread_number_pattern + r')\] - patternb$')
    returnme[2] = re.compile(r'^start_of_every_log_line <!DOCTYPE root_type SYSTEM "[\.\w]+">$')
    returnme[3] = re.compile(r'^start_of_every_log_line <root_type>$')
    returnme[4] = re.compile(r'^start_of_every_log_line <DataId>' + hey_this_is_my_data + r'<\/DataId>$')
    returnme[5] = re.compile(r'^start_of_every_log_line </root_type>$')
    returnme[6] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO      \[Thread-(' + thread_number_pattern + r')\] - patternc$')
    returnme[7] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO      \[Thread-(' + thread_number_pattern + r')\] - patternd$')
    returnme[8] = re.compile(r'^start_of_every_log_line <response><Reply><Result status="success" \/><\/Reply><\/response>$')
    returnme[9] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO      \[Thread-(' + thread_number_pattern + r')\] - patterne$')
    returnme[10] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} dumdeedum INFO      \[Thread-(' + thread_number_pattern + r')\] - patternf$')
    return returnme

def filter_lines(enumerated_lines, regex_od):
    returnme_kept_lines = OrderedDict()
    returnme_thread_numbers = []
    for line_number, line in enumerated_lines.items():
        not_yet_kept = True
        for rgx_num, rgx in regex_od.items():
            if not_yet_kept and rgx.search(line):
                not_yet_kept = False
                returnme_kept_lines[line_number] = line
                if rgx_num == 0:
                    returnme_thread_numbers.append(rgx.match(line).group(1))
    return returnme_kept_lines, returnme_thread_numbers

with open(filepath, 'r') as f:
    lines = f.readlines()

first_od = get_od(r'\d+')
kept_lines, thread_numbers = filter_lines(OrderedDict(enumerate(lines)), first_od)

def find_first_regex_occurrence_in_linesod(the_lines_od, the_regex):
    line_number_found_regex_on = -1
    if the_lines_od is None or len(the_lines_od) == 0 or the_regex is None:
        return line_number_found_regex_on
    for i, (line_num, line) in enumerate(the_lines_od.items()):
        if line_number_found_regex_on == -1:
            #print('loopline', i, 'fileline:', line_num, 'lineslen:', len(the_lines_od))
            if the_regex.search(line):
                line_number_found_regex_on = i
                #print(f'found on {i}')
    return line_number_found_regex_on

def recursively_process_subset(collector, lines_od, rgx_od, curr_rgx_key):
    #print('\n', 'recursiongo', 'lineslen:', len(lines_od), 'currregexno:', curr_rgx_key)
    if curr_rgx_key > len(rgx_od):
        return collector # Recursion base condition
    if len(lines_od) == 0:
        if curr_rgx_key < len(rgx_od):
            raise LogDidNotMatchException(f'Never got through regex key {curr_rgx_key}')
        else:
            return collector # Recursion base condition
    line_number_found_currod_on = -1
    line_number_found_currod_on = find_first_regex_occurrence_in_linesod(lines_od, rgx_od[curr_rgx_key])
    if line_number_found_currod_on == -1:
        raise LogDidNotMatchException(f'Short-circuited trying to find regex key {curr_rgx_key}')
    #print(f'recursion found for regex key {curr_rgx_key} on line {line_number_found_currod_on} of {len(lines_od)}-line logsubset')
    if (curr_rgx_key + 1) < len(rgx_od):
        currodfound_linesod_new_param = OrderedDict(islice(lines_od.items(), line_number_found_currod_on+1, len(lines_od)))
        recursively_process_subset(collector, currodfound_linesod_new_param, rgx_od, curr_rgx_key + 1)

try:
    for thread_number in thread_numbers:
        thread_number_based_od = get_od(str(thread_number))
        thread_number_kept_lines, thread_number_thread_numbers = filter_lines(kept_lines, thread_number_based_od)
        x = recursively_process_subset([], thread_number_kept_lines, thread_number_based_od, 0)
        #print('final', x, 'lenlines:', len(thread_number_kept_lines))
    print(f'Success:  All {len(first_od)} expected patterns were found in the log, in order, for ID {hey_this_is_my_data}.')
except LogDidNotMatchException:
    print(f'Failure:  Not all {len(first_od)} expected patterns were found as expected in the log for ID {hey_this_is_my_data}.  Below are lines that seemed close but were not quite enough:\n')
    {print(f'log line #{line_number}:  {line}') for line_number, line in kept_lines.items()}
    print(f'Failure:  Not all {len(first_od)} expected patterns were found as expected in the log for ID {hey_this_is_my_data}.  Above are lines that seemed close but were not quite enough.')

当我对单行有一些正则表达式时，如何解析多行日志？

问题描述

1 个解决方案

解决方案1
0 2022-12-15 02:15:44

当我对单行有一些正则表达式时，如何解析多行日志？

问题描述

1 个解决方案

解决方案1 0 2022-12-15 02:15:44

解决方案1
0 2022-12-15 02:15:44