简体   繁体   中英

regex for carriage return lines

I am trying to write a regex for logs which seems to be working fine for log entries but in some log entries there are carriage returns which then fails to pick up the next line

([0-9]{2}\s[A-Za-z]{3}\s[0-9]{4}\s[0-9]{2}:[0-9]{2}:[0-9]{2}(?:,[0-9]{3})?)\s?(.*)

above regex works fine for lines with no extra carriage return

01 Jan 2018 04:25:56,546 [TEXT] aabb33-ddee33-54321 (host-1-usa-east) this.is.sample.log: service is responding normal
02 Jan 2018 05:25:56,546 [TEXT] aabb33-ddee33-54321 (host-1-usa-east) this.is.sample.log: service is responding normal

but this fails to pick up extra line 1 and extra line 2 when on of the lines have added carriage return

01 Jan 2018 04:25:56,546 [TEXT] aabb33-ddee33-54321 (host-1-usa-east) this.is.sample.log: service is responding normal
02 Jan 2018 05:25:56,546 [TEXT] aabb33-ddee33-54321 (host-1-usa-east) this.is.sample.log: service is responding normal
extra line 1
extra line 2
03 Jan 2018 08:25:56,546 [TEXT] aabb33-ddee33-54321 (host-1-usa-east) this.is.sample.log: service is responding normal

I even tried to add ^ to match start but that only picks the first log entry

^([0-9]{2}\s[A-Za-z]{3}\s[0-9]{4}\s[0-9]{2}:[0-9]{2}:[0-9]{2}(?:,[0-9]{3})?)\s?(.*)

You might use

(?<=\n|^)(\d{2} [A-Za-z]{3} \d{4} \d{2}:\d{2}:\d{2}(?:,\d{3})?)\s?(.*?)(?=$|\n\d{2} [A-Za-z]{3} \d{4})
^^^^^^^^^                                                            ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The important part is the lookahead at the end for a date or the end of the string. Also make sure to lazy-repeat the . . The beginning also has lookbehind for a \\n or ^ instead of the m flag so that the lookahead at the end for $ will only match the end of the string, not just the end of a line.

https://regex101.com/r/YAkWBe/1

Also remember that you can simplify [0-9] to \\d .

If you can't use the s flag (allows the dot to match a newline), then instead of repeating the dot to capture the (possibly multiline) string after the date, use [\\s\\S] , which will capture everything (all non-whitespace characters, and all whitespace characters -> everything):

([\s\S]*?)

I can offer the following regex which works fine, except that it doesn't capture the very last log entry in your file:

([0-9]{2}\s[A-Za-z]{3}\s[0-9]{4}\s[0-9]{2}:[0-9]{2}:[0-9]{2}(?:,[0-9]{3})?)\s?(.*?)(?=[0-9]{2}\s[A-Za-z]{3}\s[0-9]{4}\s[0-9]{2}:[0-9]{2}:[0-9]{2}(?:,[0-9]{3}))

The long story short is that I added a lookahead to the end of your pattern, after the (.*) , which pauses when it encounters the start of the next log entry. Then, the only other change is to use (.*?) , ie make the dot lazy so that it will pause at the lookahead.

Also, this regex should be run in dot all mode, where .* would match across lines. If you don't have dot all mode explicitly available, you may be able to use [\\s\\S]* as an alternative.

Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM