I would like to extract from a log file that contains mostly Java log data (debug/errors/info) the following XML:
<envelope>
<header>
...
</header>
<body>
<Provision>
<ORDER id="XYZ_123_456" action="test">
....
</ORDER>
</Provision>
</body>
</envelope>
I only need the one which has the "Provision" tag, and which contains the ORDER id XYZ_123_456
I've tried using the following, but it also returns XMLs without the Provision tag. (I'm near clueless in awk, this is a code I've modified for this particular need)
awk '/<envelope>/ {line=$0; p=0 && x=0; next}
line {line=line ORS $0}
/ORDER/ && $2~/XYZ_123_456/ {p=1}
$0~/<Provision>/ {x=1}
/<\/envelope>/ && p && x {print line;}' dump.file
Thanks!
You shouldn't parse xml with awk. Better use xmlstarlet
. This will print the whole envelope:
$ apt-get install xmlstarlet
$ xmlstarlet sel -t -c '/envelope/body/Provision/ORDER[@id="XYZ_123_456"]/../../..' file.xml
For awk, I propose this:
awk '
!el&&/<envelope>/{el=1}
el==1&&/<body>/{el=2}
el==2&&/<Provision>/{el=3}
el==3&&/<ORDER.*id="XYZ_123_456"/{el=4;hit=1}
el>0{buffer=buffer $0 ORS}
el==4&&/<\/ORDER>/{el=3}
el==3&&/<\/Provision>/{el=2}
el==2&&/<\/body>/{el=1}
el==1&&/<\/envelope>/{el=0;if (hit){print buffer; buffer="";hit=0}}
' file.xml
This checks for the correct XML structure and print the whole envelope given the xml elements come on different lines each.
If your XML or logfile is as well-formed as you claim, you can (ab)use awk
and its RS
record separator feature to do most of the parsing for you:
awk 'BEGIN{ RS="</envelope>"; FS="<envelope>" }; $0 ~ order { print "<envelope>",$2,"</envelope>" }' order=XYZ_123_456 tmp.txt
This works by defining </envelope>
as the awk record separator and then reading all stuff between </envelope>
strings. To then strip/split other log messages, I use the FS
field separator to split the "line" into columns, and output the second column.
This will horribly fail if any <envelope>
or </envelope>
string happens to appear anywhere else in your log data, but you've already been pointed towards better XML parsers.
As the above solution requires GNU awk for multi-char RS
, here is the same solution using perl
for the case that no appropriate awk version is available:
perl -ne 'BEGIN{ $/="</envelope>";$order=shift }; /<envelope>.*$order.*/ms and print $&' XYZ_123_456 tmp.txt
$ cat tst.awk
/<envelope>/ { inEnv = 1 }
inEnv { env = env $0 ORS }
/<\/envelope>/ {
if ( env ~ /<Provision>.*<ORDER[[:space:]]+id="XYZ_123_456"/ ) {
printf "%s", env
}
env = inEnv = ""
}
$ awk -f tst.awk file
<envelope>
<header>
...
</header>
<body>
<Provision>
<ORDER id="XYZ_123_456" action="test">
....
</ORDER>
</Provision>
</body>
</envelope>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.