简体   繁体   中英

Extract text between two XML tags using sed

I have XML file similar to the following:

<?xml version="1.0" encoding="UTF-8"?>
    <doc docid="cnn_210085_comment002" articleURL="http://www.cnn.com/News.asp?NewsID=210085" date="10/07/2010" time="00:21" subtitle="Is Justin Bieber getting special treatment?" author="Zorro75">
        <seg id="1"> They are the same thing. Let's shoot them both. </seg>
    <doc docid="cnn_210092_comment004" articleURL="http://www.cnn.com/News.asp?NewsID=210092" date="06/04/2010" time="17:07" subtitle="Dear Chicago, we love you despite it all" author="MRL1313">
        <seg id="1"> We can't wait for you to move back either. </seg>
        <seg id="2"> You seem quite uptight. </seg>
        <seg id="3"> Does your wife (who is also your sister) not give it up any more? </seg>

I would like to to execute command on this file to extract only the contnet between the opening tag <seg ...> and the closing tag </seg>

I tried :

sed -n 's:.*<seg id="1">\(.*\)</seg>.*:\1:p' XML-file.xml > output.txt

My questions are the following:

-- How can I print all <seg id="*"> ?? my command prints only the the content of the first tag ( <seg id="*"> )

-- Is that is there a way that can be used to make for example the <seg id="1"> , <seg id="2"> , <seg id="3"> to be printed in the same line while the tag that include only <seg id="1"> to be printed in separate line??

Use a proper XML handling tool. For example, in XML::XSH2 :

open file.xml ;
for //doc echo seg/text() ;

print all the <seg id=> (one per line) including <seg

sed -n 's:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:\1:p' XML-file.xml > output.txt

Print all on 1 line with separated , . Use of holding buffer instead of printing and at the end, recall the buffer, replace new line by , (and remove starting , due to Append action), and print the result

sed -n '\:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:  { s//\1/
$ {g
   }' XML-file.xml > output.txt

Now, the advice of @Choroba to use adequat XML tools is very good, you minimize the risk of treating unwanted data of the file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM