简体   繁体   English

使用sed提取XML文件的元素内容

[英]Using sed to extract element content of an XML file

Well, using sed I'm trying to extract everything between <Transport_key> and </Transport_key> from input files like this: 好吧,我正在尝试使用sed从如下输入文件中提取<Transport_key></Transport_key>之间的所有内容:

<?xml version="1.0" encoding="utf-8"?>
<Envelope xmlns:xenc="http://www.w3.org/2001/04/xmlenc#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
<Header>
<Security>
<Transport_key>
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>
</Transport_key>
</Security>
</Header>
<Body>
</Body>
</Envelope>

so i want to get 所以我想得到

<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>

regardless of any optional newlines between elements. 无论元素之间是否有任何可选的换行符。 I just want the text between the two strings unmodified, even if the input is a single big line. 我只希望两个字符串之间的文本保持不变,即使输入是一条大行也是如此。

I tried with 我尝试过

sed -e "s@.*<Transport_key>\(.*\)</Transport_key>.*@\1@" test.txt

but in the meantime I learned, that sed is taking inputs line per line and it cannot work. 但与此同时,我了解到, sed每行占用输入行,但无法正常工作。

Is there a solution for that? 有解决方案吗?

For your " last try without such ... ", grep approach: 对于您的“ 最后尝试,没有这样的... ”, grep方法:

grep -Poz '<Transport_key>\s*\K[\s\S]*(?=</Transport_key>)' test.txt

The output: 输出:

<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>

For your further proper tries, xmlstarlet approach: 为了您进一步适当尝试, xmlstarlet方法:

xmlstarlet sel -t -c '//Transport_key/*' -n test.txt

It would be safier to use an xml parser but for some cases it can also be done with regex. 使用xml解析器会更安全,但在某些情况下,也可以使用正则表达式来完成。

perl -0777 -ne 'print for m@<EncryptedKey(?!</EncryptedKey).*</EncryptedKey>@gs' <test.txt

from perl -h perl -h

  • -0777 : specify record separator (octal, 777 is undef <=> read all file) -0777:指定记录分隔符(八进制,777为undef <=>读取所有文件)
  • -n : assume "while (<>) { ... }" loop around program -n:假定“ while(<>){...}”在程序周围循环

modifiers 修饰符

  • g: all matches g:所有比赛
  • s: . s : . matches \\n 符合\\n

regex: 正则表达式:

  • (?!..): negative look-ahead (?!..):否定超前

Via sed, you can try the following : 通过sed,您可以尝试以下操作:

sed -n '/<Transport_key>/,/<\/Transport_key>/p' test1.xml | sed -e '/Transport_key/d'

The first command takes everything between the Transport_key tags. 第一条命令将Transport_key标记之间的所有内容都包含在内。 Since this also prints the Transport_key tags, the second command deletes the lines containing the Transport_key tags. 由于这也会打印Transport_key标签,因此第二条命令将删除包含Transport_key标签的行。

The simplest solution to this particular problem that's independent of white space is to use GNU awk for multi-char RS: 与空白无关的此特定问题的最简单解决方案是对多字符RS使用GNU awk:

$ gawk -v RS='\\s*</?Transport_key>\\s*' 'NR==2' file
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>

$ tr -d '\n' < file
<?xml version="1.0" encoding="utf-8"?><Envelope xmlns:xenc="http://www.w3.org/2001/04/xmlenc#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ds="http://www.w3.org/2000/09/xmldsig#"><Header><Security><Transport_key><EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#"><EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" /><CipherData><CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue></CipherData></EncryptedKey></Transport_key></Security></Header><Body></Body></Envelope>

$ tr -d '\n' < file | gawk -v RS='\\s*</?Transport_key>\\s*' 'NR==2'
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#"><EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" /><CipherData><CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue></CipherData></EncryptedKey>

The reason to use an XML parser, though, is to handle things like the tag value showing up inside a string, etc. properly. 但是,使用XML解析器的原因是要正确处理诸如字符串中显示的标记值之类的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM