简体   繁体   English

XML日志文件正则表达式

[英]XML Log file regex

A legacy system I cannot change is pumping out 5 Gig of mostly awful XML logs per day and blowing my ingestion licence. 我无法更改的遗留系统每天要抽出5 Gig的糟糕的XML日志,并浪费我的摄取许可证。 There are 2 classes of verbose errors occurring 1000+ times a minute, but every few minutes one genuinely interesting entry. 每分钟发生1000多次以上的2类详细错误,但每隔几分钟就会出现一次真正有趣的条目。 I'd like to drastically shorten the repeating entries in sed, and retain the interesting ones untouched 我想大幅缩短sed中的重复条目,并保留有趣的条目

So what I need 所以我需要
1. Regexes to match each of the 2 classes of annoying log entry (eg ...'decimal'... and ...'DBNull'... but not the occasional interesting ones). 1.正则表达式匹配两个烦人的日志条目(例如...'decimal'...和...'DBNull'...,但偶尔不有趣的日志条目)中的每一个。
One regex to match each annoying error class is fine, I can do 2 sed passes 一个正则表达式可以匹配每个烦人的错误类别,我可以做2次sed传递
2. I need a capture group with the timestamp so I can replace the long XML lines with a succinct version - but with the correct timestamp so as not to lose fidelity. 2.我需要一个带有时间戳的捕获组,以便可以将简短的XML行替换为简洁的版本-但要使用正确的时间戳,以免丢失保真度。

I've gotten as far as this to match the and the capture created date: 我已经达到了与捕获创建日期相匹配的程度:

(?:<Log).*?(createdDate="\d{2}\/\d{2}\/\d{4}.\d{2}:\d{2}:\d{2}").*?(?:decimal).*?(<\/Log>)

which is close, but suffers from a kind of reverse greediness where I match the from 'decimal' to a an opening Log statement several entries earlier Have played around negative look-behind but just given myself a severe headache 这很接近,但是有一种逆向贪婪的感觉,在这种情况下,我将“十进制”与开头的对数匹配

Sample Data 样本数据

<Log type="ERROR" createdDate="11/09/2015 08:13:14" > 
 <![CDATA[ [108] -- much cruft removed-- SerializationException: There was an error deserializing the object of type Common.DataCtract.QResult. The value '' cannot be parsed as the type 'decimal'. ---> System.Xml.XmlException: The value '' cannot be parsed as the type 'decimal'. ---> System.FormatException: Input string was not in a correct format.
  ]]></Log> 

<Log type="ERROR" createdDate="11/09/2015 08:13:13" > 
 <![CDATA[ [108] -- much cruft removed-- SerializationException: There was an error deserializing the object of type Common.DataCtract.QResult. The value '' cannot be parsed as the type 'decimal'. ---> System.Xml.XmlException: The value '' cannot be parsed as the type 'decimal'. ---> System.FormatException: Input string was not in a correct format.
  ]]></Log> 

<Log type="ERROR" createdDate="11/09/2015 08:13:12" > 
 <![CDATA[ [129] Services.DService.D.FailedToAddRQ(Exceptionex, RQEntityrQ, RHeaderEntityrHeader, StringPRef, ): FailedToAddRQ()...with parameters [pRef:=123,0,1], [rQ.AffinityCode:=],[Q.thing=thing][rQ.AffinityRQDT:=123],[rHeader.RHeaderIDPK:=123],[rQ.UWriteIDFK:=] 
  Data.DataAccessLayerException: Conversion from type 'DBNull' to type 'Long' is not valid.
Parameters:
 [RETURN_VALUE][ReturnValue] Value: [0]
 ---> System.InvalidCastException: Conversion from type 'DBNull' to type 'Long' is not valid.
 ]]></Log> 

 <Log type="ERROR" createdDate="11/09/2015 08:13:11" > 
 <![CDATA[ [129] Services.DService.D.FailedToAddRQ(Exceptionex, RQEntityrQ, RHeaderEntityrHeader, StringPRef, ): FailedToAddRQ()...with parameters [pRef:=123,0,1], [rQ.AffinityCode:=],[Q.thing=thing][rQ.AffinityRQDT:=123],[rHeader.RHeaderIDPK:=123],[rQ.UWriteIDFK:=] 
  Data.DataAccessLayerException: Conversion from type 'DBNull' to type 'Long' is not valid.
  ]]></Log> 

 <Log type="ERROR" createdDate="11/09/2015 08:13:10" > 
 <![CDATA[ [231] An actual interesting log entry with a real error message ]]></Log>

<Log type="ERROR" createdDate="11/09/2015 08:13:09" > 
 <![CDATA[ [108] -- much cruft removed-- SerializationException: There was an error deserializing the object of type Common.DataCtract.QResult. The value '' cannot be parsed as the type 'decimal'. ---> System.Xml.XmlException: The value '' cannot be parsed as the type 'decimal'. ---> System.FormatException: Input string was not in a correct format.
  ]]></Log> 

Not sure what you are exaclty looking for, but this is an example of how you can isolate <Log...</Log> blocks and proceed to a replacement: 不确定您要寻找的是什么,但这是如何隔离<Log...</Log>块并进行替换的示例:

sed '/^<Log /{:a;/<\/Log>/!{N;ba;};s/>.*\(decimal\|DBNull\).*</>\1</}' file.log

details: 细节:

/^<Log / { # condition: a line that starts with "<Log "
    :a;    # define the label "a"
    /<\/Log>/! { # condition: if the line doesn't contain "</Log>"
        N;       # append the next line to the pattern space
        ba;      # go to the label "a"
    };
    s/>.*\(decimal\|DBNull\).*</>\1</ # replace the block
}

(I assumed <Log are always at the start of the line unlike records at sec 10 and 11, that are probably typos.) (我假设<Log始终位于行的开头,这与第10和11秒处的记录不同,这可能是拼写错误。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM