how to multi line regex match each distinct entry of a log file

Question

For a log file, I'm trying to get a match for each distinct entry even if it spans multiple lines. Each distinct entry will begin with a timestamp even if there are multiple lines pertaining to the entry.

Here is my log file:

2000-01-01 01:01:01 UTC This is a 2 line sentence.
This is the second line
2000-01-01 01:01:02 UTC some random text on 1 line
2000-01-01 01:01:03 UTC This is a much longer 1 line sentence that manages to wrap itself around because of its length
2022-01-01 01:01:04 UTC This multi line paragraph has a few blank lines in between lines of text
           words words words and some numbers12345

a few more words
more words on another line and the next line might be blank

2000-01-01 01:01:05 UTC some random text on 1 line
2000-01-01 06:01:06 UTC This multi line paragraph has a few blank lines in between lines of text
           words words words and some numbers678910

a few more words
more words on another line and the next line might be blank

2000-01-01 01:01:07 UTC some random text on one line

I'm trying to match essentially any line that does not begin with a timestamp.

This works well as a base, but it won't grab any entry that spans multiple lines:
^([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC [[][0-9]+[]]: [[][0-9]+[-][0-9]+[]].+\n)

I've tried adding to it to do a negative lookahead to try and get each distinct entry as a match like so, but it's not right and I get even less matches: ^([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC.+\n)(.+\n)*(?:([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC))

Is there a way to construct a regex to grab each distinct entry?

Answer 1

Your first example seems to take milliseconds into account, which I don't see in your logs.

You could do with a positive lookahead:

^([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC) (.*?)(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}|\z)

It grabs the log text until it encounters another timestamp, or the end of the input ( \z ), and captures the timestamp and log entry separately.

Regex101

Answer 2

From your first Regex, I do not understand why you are using [[][0-9]+[]]: [[][0-9]+[-][0-9]+[]].+\n after UTC and what [.][0-9]+ should be good for.

However, this is how you could make it work with Negative Lookahead:

^(?![0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC).*

So it will ignore lines which start with a timestamp until UTC .

See the result

how to multi line regex match each distinct entry of a log file

Question

2 answers

solution1
2 ACCPTED 2022-08-15 00:42:44

solution2
1 2022-08-15 00:36:24

how to multi line regex match each distinct entry of a log file

Question

2 answers

solution1 2 ACCPTED 2022-08-15 00:42:44

solution2 1 2022-08-15 00:36:24

solution1
2 ACCPTED 2022-08-15 00:42:44

solution2
1 2022-08-15 00:36:24