简体   繁体   中英

Regex to capture two groups from record

I am working on an ETL to handle the parsing of machine generated logs. These logs resemble flattened json files as csv files. The payload of the json (and its length) depend on the log type, for example error, alarm, ...

Every so often, a corrupt line occurs in the log files. These corrupt lines combine two lines into a single and start with the special charcter \x00 . As such, these corrupt lines can be identified. Still, I would like to retrieve and separate these two lines from the corrupt line.

Data example (the corrupt line is line 3):

log file
2019.09.12 07:32:00,121,INIED
2019.09.12 09:21:50,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
\x00 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
2019.09.12 10:52:38,209,RESUM

Ideally the corrupt record \x00 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !! would be retrieved as

  • group 1: 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine
  • group 2: 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!

I started with a the capturing group \d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}(.*) to get everything after the timestamps. This seemed the easiest method, as I cannot assume that the length of the line is fixed (due to the flattened json).

Questions:

  • I am unsure how to terminate my capturing group. I was thinking to use the end of the line or the next timestamp it finds. Any advice to combine these clauses?
  • In addition, this method removes the timestamps themselves from the capturing group. Should I use a different method?
  1. As you were thinking, you should include in your capturing group the end of the line and timestamp combined in an OR clause.
  2. In your expression, since you want the timestamp and text together, you don't want a capturing group with just (.*) but with the entire expression (\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}.*)

So the combination of these two would be:

(\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}.*?)(?:$|(?=\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}))

The OR clause is a non-capturing group comprised by the end of the line '$' and a 'Positive Lookahead' with the date.

You can use the site https://regexr.com/ to test and validate expressions, you should try it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM