使用sed提取XML文件的元素内容

Question

Well, using sed I'm trying to extract everything between <Transport_key> and </Transport_key> from input files like this: 好吧，我正在尝试使用sed从如下输入文件中提取<Transport_key>和</Transport_key>之间的所有内容：

<?xml version="1.0" encoding="utf-8"?>
<Envelope xmlns:xenc="http://www.w3.org/2001/04/xmlenc#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
<Header>
<Security>
<Transport_key>
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>
</Transport_key>
</Security>
</Header>
<Body>
</Body>
</Envelope>

so i want to get 所以我想得到

<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>

regardless of any optional newlines between elements. 无论元素之间是否有任何可选的换行符。 I just want the text between the two strings unmodified, even if the input is a single big line. 我只希望两个字符串之间的文本保持不变，即使输入是一条大行也是如此。

I tried with 我尝试过

sed -e "s@.*<Transport_key>\(.*\)</Transport_key>.*@\1@" test.txt

but in the meantime I learned, that sed is taking inputs line per line and it cannot work. 但与此同时，我了解到， sed每行占用输入行，但无法正常工作。

Is there a solution for that? 有解决方案吗？

Answer 1

For your " last try without such ... ", grep approach: 对于您的“ 最后尝试，没有这样的... ”， grep方法：

grep -Poz '<Transport_key>\s*\K[\s\S]*(?=</Transport_key>)' test.txt

The output: 输出：

<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>

For your further proper tries, xmlstarlet approach: 为了您进一步适当尝试， xmlstarlet方法：

xmlstarlet sel -t -c '//Transport_key/*' -n test.txt

Answer 2

It would be safier to use an xml parser but for some cases it can also be done with regex. 使用xml解析器会更安全，但在某些情况下，也可以使用正则表达式来完成。

perl -0777 -ne 'print for m@<EncryptedKey(?!</EncryptedKey).*</EncryptedKey>@gs' <test.txt

from perl -h 从perl -h

-0777 : specify record separator (octal, 777 is undef <=> read all file) -0777：指定记录分隔符（八进制，777为undef <=>读取所有文件）
-n : assume "while (<>) { ... }" loop around program -n：假定“ while（<>）{...}”在程序周围循环

modifiers 修饰符

g: all matches g：所有比赛
s: . s ： . matches \\n 符合\\n

regex: 正则表达式：

(?!..): negative look-ahead （？！..）：否定超前

Answer 3

Via sed, you can try the following : 通过sed，您可以尝试以下操作：

sed -n '/<Transport_key>/,/<\/Transport_key>/p' test1.xml | sed -e '/Transport_key/d'

The first command takes everything between the Transport_key tags. 第一条命令将Transport_key标记之间的所有内容都包含在内。 Since this also prints the Transport_key tags, the second command deletes the lines containing the Transport_key tags. 由于这也会打印Transport_key标签，因此第二条命令将删除包含Transport_key标签的行。

Answer 4

The simplest solution to this particular problem that's independent of white space is to use GNU awk for multi-char RS: 与空白无关的此特定问题的最简单解决方案是对多字符RS使用GNU awk：

$ gawk -v RS='\\s*</?Transport_key>\\s*' 'NR==2' file
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>

$ tr -d '\n' < file
<?xml version="1.0" encoding="utf-8"?><Envelope xmlns:xenc="http://www.w3.org/2001/04/xmlenc#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ds="http://www.w3.org/2000/09/xmldsig#"><Header><Security><Transport_key><EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#"><EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" /><CipherData><CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue></CipherData></EncryptedKey></Transport_key></Security></Header><Body></Body></Envelope>

$ tr -d '\n' < file | gawk -v RS='\\s*</?Transport_key>\\s*' 'NR==2'
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#"><EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" /><CipherData><CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue></CipherData></EncryptedKey>

The reason to use an XML parser, though, is to handle things like the tag value showing up inside a string, etc. properly. 但是，使用XML解析器的原因是要正确处理诸如字符串中显示的标记值之类的事情。

使用sed提取XML文件的元素内容

问题描述

4 个解决方案

解决方案1
2 已采纳 2017-07-06 21:16:15

解决方案2
0 2017-07-06 20:09:15

解决方案3
0 2017-07-06 22:00:50

解决方案4
0 2017-07-07 01:36:28

使用sed提取XML文件的元素内容

问题描述

4 个解决方案

解决方案1 2 已采纳 2017-07-06 21:16:15

解决方案2 0 2017-07-06 20:09:15

解决方案3 0 2017-07-06 22:00:50

解决方案4 0 2017-07-07 01:36:28

解决方案1
2 已采纳 2017-07-06 21:16:15

解决方案2
0 2017-07-06 20:09:15

解决方案3
0 2017-07-06 22:00:50

解决方案4
0 2017-07-07 01:36:28