简体   繁体   中英

how to multi line regex match each distinct entry of a log file

For a log file, I'm trying to get a match for each distinct entry even if it spans multiple lines. Each distinct entry will begin with a timestamp even if there are multiple lines pertaining to the entry.

Here is my log file:

2000-01-01 01:01:01 UTC This is a 2 line sentence.
This is the second line
2000-01-01 01:01:02 UTC some random text on 1 line
2000-01-01 01:01:03 UTC This is a much longer 1 line sentence that manages to wrap itself around because of its length
2022-01-01 01:01:04 UTC This multi line paragraph has a few blank lines in between lines of text
           words words words and some numbers12345

a few more words
more words on another line and the next line might be blank

2000-01-01 01:01:05 UTC some random text on 1 line
2000-01-01 06:01:06 UTC This multi line paragraph has a few blank lines in between lines of text
           words words words and some numbers678910

a few more words
more words on another line and the next line might be blank

2000-01-01 01:01:07 UTC some random text on one line

I'm trying to match essentially any line that does not begin with a timestamp.

This works well as a base, but it won't grab any entry that spans multiple lines:
^([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC [[][0-9]+[]]: [[][0-9]+[-][0-9]+[]].+\n)

I've tried adding to it to do a negative lookahead to try and get each distinct entry as a match like so, but it's not right and I get even less matches: ^([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC.+\n)(.+\n)*(?:([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC))

Is there a way to construct a regex to grab each distinct entry?

Your first example seems to take milliseconds into account, which I don't see in your logs.

You could do with a positive lookahead:

^([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC) (.*?)(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}|\z)

It grabs the log text until it encounters another timestamp, or the end of the input ( \z ), and captures the timestamp and log entry separately.

Regex101

From your first Regex, I do not understand why you are using [[][0-9]+[]]: [[][0-9]+[-][0-9]+[]].+\n after UTC and what [.][0-9]+ should be good for.

However, this is how you could make it work with Negative Lookahead:

^(?![0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC).*

So it will ignore lines which start with a timestamp until UTC .

See the result

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM