I am working on an ETL to handle the parsing of machine generated logs. These logs resemble flattened json files as csv files. The payload of the json (and its length) depend on the log type, for example error, alarm, ...
Every so often, a corrupt line occurs in the log files. These corrupt lines combine two lines into a single and start with the special charcter \x00
. As such, these corrupt lines can be identified. Still, I would like to retrieve and separate these two lines from the corrupt line.
Data example (the corrupt line is line 3):
log file |
---|
2019.09.12 07:32:00,121,INIED |
2019.09.12 09:21:50,611,ALARM ,E,303,ARM 2 VACUUM ERROR !! |
\x00 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !! |
2019.09.12 10:52:38,209,RESUM |
Ideally the corrupt record \x00 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
would be retrieved as
2019.09.12 10:04:46,611,ALARM ,O,501, Check machine
2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
I started with a the capturing group \d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}(.*)
to get everything after the timestamps. This seemed the easiest method, as I cannot assume that the length of the line is fixed (due to the flattened json).
Questions:
So the combination of these two would be:
(\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}.*?)(?:$|(?=\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}))
The OR clause is a non-capturing group comprised by the end of the line '$' and a 'Positive Lookahead' with the date.
You can use the site https://regexr.com/ to test and validate expressions, you should try it.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.