使用sed提取两个XML标签之间的文本

Question

I have XML file similar to the following: 我有类似于以下内容的XML文件：

<?xml version="1.0" encoding="UTF-8"?>
<OnlineCommentary>
    <doc docid="cnn_210085_comment002" articleURL="http://www.cnn.com/News.asp?NewsID=210085" date="10/07/2010" time="00:21" subtitle="Is Justin Bieber getting special treatment?" author="Zorro75">
        <seg id="1"> They are the same thing. Let's shoot them both. </seg>
    </doc>
    <doc docid="cnn_210092_comment004" articleURL="http://www.cnn.com/News.asp?NewsID=210092" date="06/04/2010" time="17:07" subtitle="Dear Chicago, we love you despite it all" author="MRL1313">
        <seg id="1"> We can't wait for you to move back either. </seg>
        <seg id="2"> You seem quite uptight. </seg>
        <seg id="3"> Does your wife (who is also your sister) not give it up any more? </seg>
    </doc>
</OnlineCommentary>

I would like to to execute command on this file to extract only the contnet between the opening tag <seg ...> and the closing tag </seg> 我想对该文件执行命令以仅提取开始标记<seg ...>和结束标记</seg>

I tried : 我试过了：

sed -n 's:.*<seg id="1">\(.*\)</seg>.*:\1:p' XML-file.xml > output.txt

My questions are the following: 我的问题如下：

-- How can I print all <seg id="*"> ?? -如何打印所有<seg id="*"> ？ my command prints only the the content of the first tag ( <seg id="*"> ) 我的命令仅打印第一个标签的内容（ <seg id="*"> ）

-- Is that is there a way that can be used to make for example the <seg id="1"> , <seg id="2"> , <seg id="3"> to be printed in the same line while the tag that include only <seg id="1"> to be printed in separate line?? -是否有一种方法可以使<seg id="1"> ， <seg id="2"> ， <seg id="3">打印在同一行中而仅包含<seg id="1">的标记将在单独的行中打印？

Answer 1

Use a proper XML handling tool. 使用适当的XML处理工具。 For example, in XML::XSH2 : 例如，在XML :: XSH2中：

open file.xml ;
for //doc echo seg/text() ;

Answer 2

print all the <seg id=> (one per line) including <seg 打印所有<seg id=> （每行一个），包括<seg

sed -n 's:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:\1:p' XML-file.xml > output.txt

Print all on 1 line with separated , . 打印所有与分开的1号线, 。 Use of holding buffer instead of printing and at the end, recall the buffer, replace new line by , (and remove starting , due to Append action), and print the result 使用的保持缓冲液代替印刷，并在结束时，回想起缓冲器中，通过更换新的线, （和除去起始,由于追加动作），并打印结果

sed -n '\:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:  { s//\1/
   H
   }
$ {g
   s/\n/,/g;s/^,//
   p
   }' XML-file.xml > output.txt

Now, the advice of @Choroba to use adequat XML tools is very good, you minimize the risk of treating unwanted data of the file. 现在，@ Choroba使用适当的XML工具的建议非常好，可以最大程度地减少处理文件中不需要的数据的风险。

使用sed提取两个XML标签之间的文本

问题描述

2 个解决方案

解决方案1
1 2014-09-19 09:10:22

解决方案2
1 已采纳 2014-09-19 09:39:07

使用sed提取两个XML标签之间的文本

问题描述

2 个解决方案

解决方案1 1 2014-09-19 09:10:22

解决方案2 1 已采纳 2014-09-19 09:39:07

解决方案1
1 2014-09-19 09:10:22

解决方案2
1 已采纳 2014-09-19 09:39:07