简体   繁体   English

使用“ awk”从日志文件中提取特定的XML模式

[英]Extract specific XML pattern from log file using 'awk'

I would like to extract from a log file that contains mostly Java log data (debug/errors/info) the following XML: 我想从一个主要包含Java日志数据(调试/错误/信息)的日志文件中提取以下XML:

<envelope>
    <header>
        ...
    </header>
    <body>
        <Provision>
            <ORDER id="XYZ_123_456" action="test">
                ....
            </ORDER>
        </Provision>
    </body>
</envelope>

I only need the one which has the "Provision" tag, and which contains the ORDER id XYZ_123_456 我只需要一个带有“ Provision”标签且包含ORDER ID XYZ_123_456的标签

I've tried using the following, but it also returns XMLs without the Provision tag. 我尝试使用以下内容,但它也返回不带Provision标签的XML。 (I'm near clueless in awk, this is a code I've modified for this particular need) (我在awk中几乎一无所知,这是我为满足这一特殊需求而修改的代码)

awk '/<envelope>/ {line=$0; p=0 && x=0; next}
     line   {line=line ORS $0}
    /ORDER/ && $2~/XYZ_123_456/ {p=1}
    $0~/<Provision>/ {x=1}
   /<\/envelope>/ && p && x {print line;}' dump.file

Thanks! 谢谢!

You shouldn't parse xml with awk. 您不应该使用awk解析xml。 Better use xmlstarlet . 最好使用xmlstarlet This will print the whole envelope: 这将打印整个信封:

$ apt-get install xmlstarlet
$ xmlstarlet sel -t -c '/envelope/body/Provision/ORDER[@id="XYZ_123_456"]/../../..' file.xml

For awk, I propose this: 对于awk,我建议这样做:

awk '
    !el&&/<envelope>/{el=1}
    el==1&&/<body>/{el=2}
    el==2&&/<Provision>/{el=3}
    el==3&&/<ORDER.*id="XYZ_123_456"/{el=4;hit=1}
    el>0{buffer=buffer $0 ORS}
    el==4&&/<\/ORDER>/{el=3}
    el==3&&/<\/Provision>/{el=2}
    el==2&&/<\/body>/{el=1}
    el==1&&/<\/envelope>/{el=0;if (hit){print buffer; buffer="";hit=0}}
' file.xml

This checks for the correct XML structure and print the whole envelope given the xml elements come on different lines each. 如果xml元素分别位于不同的行,这将检查正确的XML结构并打印整个信封。

If your XML or logfile is as well-formed as you claim, you can (ab)use awk and its RS record separator feature to do most of the parsing for you: 如果您的XML或日志文件格式符合您的要求,则可以(ab)使用awk及其RS记录分隔符功能来为您执行大部分分析:

 awk 'BEGIN{ RS="</envelope>"; FS="<envelope>" }; $0 ~ order { print "<envelope>",$2,"</envelope>" }' order=XYZ_123_456 tmp.txt

This works by defining </envelope> as the awk record separator and then reading all stuff between </envelope> strings. 通过将</envelope>定义为awk记录分隔符,然后读取</envelope>字符串之间的所有内容,可以起作用。 To then strip/split other log messages, I use the FS field separator to split the "line" into columns, and output the second column. 然后,要剥离/拆分其他日志消息,我使用FS字段分隔符将“行”拆分为列,然后输出第二列。

This will horribly fail if any <envelope> or </envelope> string happens to appear anywhere else in your log data, but you've already been pointed towards better XML parsers. 如果任何<envelope></envelope>字符串恰巧出现在日志数据中的其他任何地方,这将严重失败,但是您已经被指向了更好的XML解析器。

As the above solution requires GNU awk for multi-char RS , here is the same solution using perl for the case that no appropriate awk version is available: 由于上述解决方案要求GNU awk用于多字符RS ,因此在没有合适的awk版本可用的情况下,这是使用perl的相同解决方案:

 perl -ne 'BEGIN{ $/="</envelope>";$order=shift }; /<envelope>.*$order.*/ms and print $&' XYZ_123_456 tmp.txt
$ cat tst.awk
/<envelope>/ { inEnv = 1 }
inEnv { env = env $0 ORS }
/<\/envelope>/ {
    if ( env ~ /<Provision>.*<ORDER[[:space:]]+id="XYZ_123_456"/ ) {
        printf "%s", env
    }
    env = inEnv = ""
}

$ awk -f tst.awk file
<envelope>
    <header>
        ...
    </header>
    <body>
        <Provision>
            <ORDER id="XYZ_123_456" action="test">
                ....
            </ORDER>
        </Provision>
    </body>
</envelope>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM