简体   繁体   中英

extract text from xml elements using awk

I've a file with ~ 10k of this type of xml tag:

<!-- http://purl.obolibrary.org/obo/HP_0100516 -->

<owl:Class rdf:about="http://purl.obolibrary.org/obo/HP_0100516">
    <obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">The presence of a neoplasm of the ureter.</obo:IAO_0000115>
    <oboInOwl:created_by rdf:datatype="http://www.w3.org/2001/XMLSchema#string">doelkens</oboInOwl:created_by>
    <oboInOwl:creation_date rdf:datatype="http://www.w3.org/2001/XMLSchema#string">2010-12-20T10:35:11Z</oboInOwl:creation_date>
    <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">UMLS:C0041955</oboInOwl:hasDbXref>
    <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Neoplasia of the ureters</oboInOwl:hasRelatedSynonym>
    <oboInOwl:hasRelatedSynonym>ureter, cancer of</oboInOwl:hasRelatedSynonym>
    <oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">HP:0100516</oboInOwl:id>
    <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Neoplasm of the ureter</rdfs:label>
</owl:Class>
<owl:Axiom>
    <owl:annotatedSource rdf:resource="http://purl.obolibrary.org/obo/HP_0100516"/>
    <owl:annotatedProperty rdf:resource="http://purl.obolibrary.org/obo/IAO_0000115"/>
    <owl:annotatedTarget rdf:datatype="http://www.w3.org/2001/XMLSchema#string">The presence of a neoplasm of the ureter.</owl:annotatedTarget>
    <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">HPO:probinson</oboInOwl:hasDbXref>
</owl:Axiom>

and I want to convert to a tab delimited text file with only 2 of the xml elements:

Neoplasm of the ureter  The presence of a neoplasm of the ureter

By using awk .

The text I need extract is within these tags:

<obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">The presence of a neoplasm of the ureter.</obo:IAO_0000115>

and

<rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Neoplasm of the ureter</rdfs:label>

and the awk script I plan to use:

BEGIN{RS="//"}
{
  match($0, regex1 , a)
  match($0, regex2, b)
  print a[1], "\t", b[1]
}

What's the best way to use regex to obtain the text inside the xml elements?

NOTE: this approach has been very useful and demonstrates that awk can be used to extract xml text from complex xml/rdf structures

the final awk script used thanks to @RavinderSingh13:

awk '
/obo:IAO_0000115 rdf:datatype/ && match($0,/>.*</,a){
  gsub(/^>|<$/,"",a[0])
  
}
/rdfs:label rdf:datatype/ && match($0,/>.*</,b){
  gsub(/^>|<$/,"",b[0])
  print b[0]"\t"a[0]
}
'  file.xml > output.txt

Could you please try following, based on your shown samples only. Also awk is not an ideal tool for xml parsing since OP mentioned specifically OP can't use any other tools so going with this approach here.

awk '
(/obo:IAO_0000115 rdf:datatype/ || /rdfs:label rdf:datatype/) && match($0,/>.*</){
  print substr($0,RSTART+1,RLENGTH-2)
}
'  Input_file

Explanation: Adding detailed explanation for above.

awk '                                         ####Starting awk program from here.
(/obo:IAO_0000115 rdf:datatype/ || /rdfs:label rdf:datatype/) && match($0,/>.*</){    ####Chcecking condition if line contains obo:IAO_0000115 rdf:datatype OR rdfs:label rdf:datatype AND matches everythig from > to till < in current line.
  print substr($0,RSTART+1,RLENGTH-2)         ####Printing sub-string from RSTART to till RLENGTH here, where RSTART and RLENGTH variables are set whenever a match function has TRUE/matched regex in it.
}
'  Input_file                                 ####Mentioning Input_file here.

From man awk :

RSTART The index of the first character matched by match(); 0 if no match. (This implies that character indices start at one.) RLENGTH The length of the string matched by match(); -1 if no match.



EDIT: Adding 1 more solution as per OP's comment in case someone wants to create 2 different arrays out of 2 different string searches then try following. Written and tested in GNU awk .

awk '
/obo:IAO_0000115 rdf:datatype/ && match($0,/>.*</,a){
  gsub(/^>|<$/,"",a[0])
  print a[0]
}
/rdfs:label rdf:datatype/ && match($0,/>.*</,b){
  gsub(/^>|<$/,"",b[0])
  print b[0]
}
'  Input_file

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM