简体   繁体   中英

Parse multiline log entries using a regex

I'm trying to parse log entries in a C# app using this regex: (^[0-9]{4}(-[0-9]{2}){2}([^|]+\\|){3})(?!\\1) for logs in a format like [date (in some format)] | [level] | [appname] | [message] .

Where (I think):

  • ^ matches the begin of a line (enabled /gm on regex101)
  • [0-9]{4}(-[0-9]{2}){2} followed by the begin of the date like 2015-03-03
  • ([^|]+\\|){3}) followed by the rest of the date, the log level and app name
  • (?!\\1) followed by not the start of a new log entry (should be the message)

For example, I have the following 4 log entries (separated by a newline for clarification):

2015-03-03 19:30:47.2725|INFO|MyApp|This is a single line log message.

2015-03-03 19:31:29.1209|INFO|MyApp|This log message has multiple
lines with
2015-03-03
a date in it.

2015-03-03 19:32:50.1106|INFO|MyApp|This log message has
multiple lines
but just text only.

2015-03-03 19:33:20.2683|ERROR|MyApp|This log message has multiple lines but
also some confusing text like
2015-03-03 19:33:20.2683|ERROR| which should
still be a valid log message.

But the regex does not capture the message when I test it on regex101 , probably because I don't understand how to capture the negative lookahead.

If I include .* in the regex: (^[0-9]{4}(-[0-9]{2}){2}([^|]+\\|){3}).*(?!\\1) it matches the message but only a single line (because . does not match a newline).

So how can I capture the (multiline) message?

You can use this regex:

(^\d{4}(-\d{2}){2}([^|]+\|){3})([\s\S]*?)\n*(?=^\d{4}.*?(?:[^|\n]+\|){3}|\z)

RegEx Demo

This regex should work in C# as well, just make sure to use MULTILINE flag.

Something like this should work.
See the comments in the regex.
( mod : make line break optional for EOS or single line message)

 @"(?m)^[0-9]{4}(?:-[0-9]{2}){2}(?:[^|\r\n]+\|){3}((?:(?!^[0-9]{4}(?:-[0-9]{2}){2}(?:[^|\r\n]+\|){3}).*(?:\r?\n)?)+)"

Formatted ( with this ):

 (?m)                          # Modifier - multiline
 ^                             # BOL
 [0-9]{4}                      # Message header
 (?: - [0-9]{2} ){2}
 (?: [^|\r\n]+ \| ){3}
 (                             # (1 start), The Message
      (?:
           (?!                           # Assert, not a Message header
                ^                             # BOL
                [0-9]{4} 
                (?: - [0-9]{2} ){2}
                (?: [^|\r\n]+ \| ){3}
           )
           .*                            # Line is ok, its part of the message
           (?: \r? \n )?                 # Optional line break
      )+
 )                             # (1 end)

Output:

 **  Grp 0 -  ( pos 0 , len 74 ) 
2015-03-03 19:30:47.2725|INFO|MyApp|This is a single line log message.


 **  Grp 1 -  ( pos 36 , len 38 ) 
This is a single line log message.

--------------

 **  Grp 0 -  ( pos 74 , len 108 ) 
2015-03-03 19:31:29.1209|INFO|MyApp|This log message has multiple
lines with
2015-03-03
a date in it.


 **  Grp 1 -  ( pos 110 , len 72 ) 
This log message has multiple
lines with
2015-03-03
a date in it.

--------------

 **  Grp 0 -  ( pos 182 , len 97 ) 
2015-03-03 19:32:50.1106|INFO|MyApp|This log message has
multiple lines
but just text only.


 **  Grp 1 -  ( pos 218 , len 61 ) 
This log message has
multiple lines
but just text only.

--------------

 **  Grp 0 -  ( pos 279 , len 186 ) 
2015-03-03 19:33:20.2683|ERROR|MyApp|This log message has multiple lines but
also some confusing text like
2015-03-03 19:33:20.2683|ERROR| which should
still be a valid log message.

 **  Grp 1 -  ( pos 316 , len 149 ) 
This log message has multiple lines but
also some confusing text like
2015-03-03 19:33:20.2683|ERROR| which should
still be a valid log message.

What regex engine are you using? In Java for example there is a flag to tell "." to match newline characters.

The following regex appears to do the trick:

/(([0-9]{4})(-[0-9]{2}){2}([^|]+\|){3})((.(?!\2))*)/sg

Modifications I made to your query were mostly some cleanup (your date capturing group was wrong). I then added a . and * in that final capturing group. https://regex101.com/r/fU1vV1/2

The most important part is the use of the sg flags. g makes it get all matches. s makes it treat it all like a single line (otherwise your negative lookahead would never work). All of this would be unnecessary if you could guarantee the comments were on one line (which they are in your example) since you could just capture to the end of the line.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM