简体   繁体   中英

XML Log file regex

A legacy system I cannot change is pumping out 5 Gig of mostly awful XML logs per day and blowing my ingestion licence. There are 2 classes of verbose errors occurring 1000+ times a minute, but every few minutes one genuinely interesting entry. I'd like to drastically shorten the repeating entries in sed, and retain the interesting ones untouched

So what I need
1. Regexes to match each of the 2 classes of annoying log entry (eg ...'decimal'... and ...'DBNull'... but not the occasional interesting ones).
One regex to match each annoying error class is fine, I can do 2 sed passes
2. I need a capture group with the timestamp so I can replace the long XML lines with a succinct version - but with the correct timestamp so as not to lose fidelity.

I've gotten as far as this to match the and the capture created date:

(?:<Log).*?(createdDate="\d{2}\/\d{2}\/\d{4}.\d{2}:\d{2}:\d{2}").*?(?:decimal).*?(<\/Log>)

which is close, but suffers from a kind of reverse greediness where I match the from 'decimal' to a an opening Log statement several entries earlier Have played around negative look-behind but just given myself a severe headache

Sample Data

<Log type="ERROR" createdDate="11/09/2015 08:13:14" > 
 <![CDATA[ [108] -- much cruft removed-- SerializationException: There was an error deserializing the object of type Common.DataCtract.QResult. The value '' cannot be parsed as the type 'decimal'. ---> System.Xml.XmlException: The value '' cannot be parsed as the type 'decimal'. ---> System.FormatException: Input string was not in a correct format.
  ]]></Log> 

<Log type="ERROR" createdDate="11/09/2015 08:13:13" > 
 <![CDATA[ [108] -- much cruft removed-- SerializationException: There was an error deserializing the object of type Common.DataCtract.QResult. The value '' cannot be parsed as the type 'decimal'. ---> System.Xml.XmlException: The value '' cannot be parsed as the type 'decimal'. ---> System.FormatException: Input string was not in a correct format.
  ]]></Log> 

<Log type="ERROR" createdDate="11/09/2015 08:13:12" > 
 <![CDATA[ [129] Services.DService.D.FailedToAddRQ(Exceptionex, RQEntityrQ, RHeaderEntityrHeader, StringPRef, ): FailedToAddRQ()...with parameters [pRef:=123,0,1], [rQ.AffinityCode:=],[Q.thing=thing][rQ.AffinityRQDT:=123],[rHeader.RHeaderIDPK:=123],[rQ.UWriteIDFK:=] 
  Data.DataAccessLayerException: Conversion from type 'DBNull' to type 'Long' is not valid.
Parameters:
 [RETURN_VALUE][ReturnValue] Value: [0]
 ---> System.InvalidCastException: Conversion from type 'DBNull' to type 'Long' is not valid.
 ]]></Log> 

 <Log type="ERROR" createdDate="11/09/2015 08:13:11" > 
 <![CDATA[ [129] Services.DService.D.FailedToAddRQ(Exceptionex, RQEntityrQ, RHeaderEntityrHeader, StringPRef, ): FailedToAddRQ()...with parameters [pRef:=123,0,1], [rQ.AffinityCode:=],[Q.thing=thing][rQ.AffinityRQDT:=123],[rHeader.RHeaderIDPK:=123],[rQ.UWriteIDFK:=] 
  Data.DataAccessLayerException: Conversion from type 'DBNull' to type 'Long' is not valid.
  ]]></Log> 

 <Log type="ERROR" createdDate="11/09/2015 08:13:10" > 
 <![CDATA[ [231] An actual interesting log entry with a real error message ]]></Log>

<Log type="ERROR" createdDate="11/09/2015 08:13:09" > 
 <![CDATA[ [108] -- much cruft removed-- SerializationException: There was an error deserializing the object of type Common.DataCtract.QResult. The value '' cannot be parsed as the type 'decimal'. ---> System.Xml.XmlException: The value '' cannot be parsed as the type 'decimal'. ---> System.FormatException: Input string was not in a correct format.
  ]]></Log> 

Not sure what you are exaclty looking for, but this is an example of how you can isolate <Log...</Log> blocks and proceed to a replacement:

sed '/^<Log /{:a;/<\/Log>/!{N;ba;};s/>.*\(decimal\|DBNull\).*</>\1</}' file.log

details:

/^<Log / { # condition: a line that starts with "<Log "
    :a;    # define the label "a"
    /<\/Log>/! { # condition: if the line doesn't contain "</Log>"
        N;       # append the next line to the pattern space
        ba;      # go to the label "a"
    };
    s/>.*\(decimal\|DBNull\).*</>\1</ # replace the block
}

(I assumed <Log are always at the start of the line unlike records at sec 10 and 11, that are probably typos.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM