简体   繁体   中英

Extracting XML elements which contains a certain string with sed

I have a file like below

  <AUDIT_RECORD TIMESTAMP="2013-07-30T17:52:29" NAME="Query" CONNECTION_ID="10" STATUS="0" SQLTEXT="show databases"/>
  <AUDIT_RECORD TIMESTAMP="2013-07-29T17:27:53" NAME="Quit" CONNECTION_ID="12" STATUS="0"/>
  <AUDIT_RECORD TIMESTAMP="2013-07-30T17:52:29" NAME="Query" CONNECTION_ID="10" STATUS="0" SQLTEXT="show grants for root@localhost"/>
  <AUDIT_RECORD TIMESTAMP="2013-07-30T17:52:29" NAME="Query" CONNECTION_ID="10" STATUS="0" SQLTEXT="create table stamp like paper"/>

Here each record begin with <AUDIT_RECORD and end with "/> and the record might spread across multiple lines.

My requirement is to display result like below

  <AUDIT_RECORD TIMESTAMP="2013-07-30T17:52:29" NAME="Query" CONNECTION_ID="10" STATUS="0" SQLTEXT="show databases"/>
  <AUDIT_RECORD TIMESTAMP="2013-07-30T17:52:29" NAME="Query" CONNECTION_ID="10" STATUS="0" SQLTEXT="show grants for root@localhost"/>
  <AUDIT_RECORD TIMESTAMP="2013-07-30T17:52:29" NAME="Query" CONNECTION_ID="10" STATUS="0" SQLTEXT="create table stamp like paper"/>

for that purpose I have used

sed -n "/Query/,/\/>/p" file.txt

but it is displaying the entire file including the record with the string "Quit".

Can anyone help me regarding this? Also please let me know if it is possible to match exactly string named "Query" ( like grep -w "Query" ).

With GNU awk so you can set the RS to more than one character:

$ cat file
<AUDIT_RECORD TIMESTAMP="2013-07-30T17:52:29" NAME="Query"
                CONNECTION_ID="10" STATUS="0" SQLTEXT="show databases"/>
<AUDIT_RECORD TIMESTAMP="2013-07-29T17:27:53"
        NAME="Quit" CONNECTION_ID="12" STATUS="0"/>
<AUDIT_RECORD
        TIMESTAMP="2013-07-30T17:52:29" NAME="Query" CONNECTION_ID="10"
     STATUS="0" SQLTEXT="show grants for root@localhost"/>
<AUDIT_RECORD
        TIMESTAMP="2013-07-30T17:52:29"
        NAME="Query"
        CONNECTION_ID="10"
        STATUS="0"
        SQLTEXT="create table stamp like paper"/>
$
$ gawk -v RS='\\/>\n' -v ORS= '/Query/{print $0 RT}' file
<AUDIT_RECORD TIMESTAMP="2013-07-30T17:52:29" NAME="Query"
                CONNECTION_ID="10" STATUS="0" SQLTEXT="show databases"/>
<AUDIT_RECORD
        TIMESTAMP="2013-07-30T17:52:29" NAME="Query" CONNECTION_ID="10"
     STATUS="0" SQLTEXT="show grants for root@localhost"/>
<AUDIT_RECORD
        TIMESTAMP="2013-07-30T17:52:29"
        NAME="Query"
        CONNECTION_ID="10"
        STATUS="0"
        SQLTEXT="create table stamp like paper"/>
$
$ gawk -v RS='\\/>\n' -v ORS= '/Query/{$1=$1; print $0 RT}' file
<AUDIT_RECORD TIMESTAMP="2013-07-30T17:52:29" NAME="Query" CONNECTION_ID="10" STATUS="0" SQLTEXT="show databases"/>
<AUDIT_RECORD TIMESTAMP="2013-07-30T17:52:29" NAME="Query" CONNECTION_ID="10" STATUS="0" SQLTEXT="show grants for root@localhost"/>
<AUDIT_RECORD TIMESTAMP="2013-07-30T17:52:29" NAME="Query" CONNECTION_ID="10" STATUS="0" SQLTEXT="create table stamp like paper"/>

I agree with @choroba that an XML parser is the right tool. However, if there isn't one available you could try this awk script:

awk '/Query/{print RS" "$0}' RS='<AUDIT_RECORD' file

The input is probably XML. Use a proper parser to handle it, especially if the records span over multiple lines. For example, xsh :

open file.xml ;
remove //AUDIT_RECORD[not(@NAME="Query")] ;
save :b ;

My proposed sed solution :

sed 's/<[^>]*\"Quit\"[^>]*>//' file.txt

For records spanning multiple lines, try :

sed '{:q;N;s/\n/ /g;t q}' file.txt | sed 's/<[^>]*\"Quit\"[^>]*>//'

Add line feed RS :

... | sed 's|/>|/>\n|g'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM