简体   繁体   English

正则表达式从记录中捕获两组

[英]Regex to capture two groups from record

I am working on an ETL to handle the parsing of machine generated logs.我正在研究一个 ETL 来处理机器生成的日志的解析。 These logs resemble flattened json files as csv files.这些日志类似于 csv 文件的扁平化 json 文件。 The payload of the json (and its length) depend on the log type, for example error, alarm, ... json 的有效负载(及其长度)取决于日志类型,例如错误、警报、...

Every so often, a corrupt line occurs in the log files.每隔一段时间,日志文件中就会出现损坏的行。 These corrupt lines combine two lines into a single and start with the special charcter \x00 .这些损坏的行将两行合并为一行,并以特殊字符\x00开头。 As such, these corrupt lines can be identified.因此,可以识别这些损坏的行。 Still, I would like to retrieve and separate these two lines from the corrupt line.不过,我想检索这两行并将其与损坏的行分开。

Data example (the corrupt line is line 3):数据示例(损坏的行是第 3 行):

log file日志文件
2019.09.12 07:32:00,121,INIED 2019.09.12 07:32:00,121,INIED
2019.09.12 09:21:50,611,ALARM ,E,303,ARM 2 VACUUM ERROR !! 2019.09.12 09:21:50,611,ALARM ,E,303,ARM 2 真空错误!!
\x00 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !! \x00 2019.09.12 10:04:46,611,ALARM ,O,501, 检查机 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
2019.09.12 10:52:38,209,RESUM 2019.09.12 10:52:38,209,简历

Ideally the corrupt record \x00 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!理想情况下损坏记录\x00 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !! would be retrieved as将被检索为

  • group 1: 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine第 1 组: 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine
  • group 2: 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!第 2 组: 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!

I started with a the capturing group \d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}(.*) to get everything after the timestamps.我从捕获组\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}(.*)开始获取所有内容时间戳。 This seemed the easiest method, as I cannot assume that the length of the line is fixed (due to the flattened json).这似乎是最简单的方法,因为我不能假设线的长度是固定的(由于扁平化的 json)。

Questions:问题:

  • I am unsure how to terminate my capturing group.我不确定如何终止我的捕获组。 I was thinking to use the end of the line or the next timestamp it finds.我正在考虑使用行尾或它找到的下一个时间戳。 Any advice to combine these clauses?有什么建议可以结合这些条款吗?
  • In addition, this method removes the timestamps themselves from the capturing group.此外,此方法会从捕获组中删除时间戳本身。 Should I use a different method?我应该使用不同的方法吗?
  1. As you were thinking, you should include in your capturing group the end of the line and timestamp combined in an OR clause.正如您所想的那样,您应该在捕获组中包含组合在 OR 子句中的行尾和时间戳。
  2. In your expression, since you want the timestamp and text together, you don't want a capturing group with just (.*) but with the entire expression (\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}.*)在您的表达式中,由于您希望将时间戳和文本放在一起,因此您不希望捕获组仅包含 (.*) 而是包含整个表达式 (\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}.*)

So the combination of these two would be:所以这两者的组合将是:

(\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}.*?)(?:$|(?=\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}))

The OR clause is a non-capturing group comprised by the end of the line '$' and a 'Positive Lookahead' with the date. OR 子句是一个非捕获组,由“$”行的末尾和带有日期的“Positive Lookahead”组成。

You can use the site https://regexr.com/ to test and validate expressions, you should try it.您可以使用站点https://regexr.com/来测试和验证表达式,您应该尝试一下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM