简体   繁体   English

使用 bash 将 XML 转换为竖线分隔的文件

[英]Converting XML into a pipe-delimited file using bash

How can I remove the entry tag and convert this XML into a pipe-delimited file?如何删除条目标记并将此 XML 转换为管道分隔文件?

<entry><company>ABC</company><appname>XYZ</appname><appid>12345678</appid><updated>2014-04-29T20:58:00-07:00</updated><msgid>923605123</msgid><title>Crash</title><content type="text">Whenever you try to use the graph function.  I expect better from Schwab</content><version>4.1.3.6</version><rating>1</rating></entry>

Expected output format:预期输出格式:

ABC|XYZ|12345678|2014-04-29T20:58:00-07:00|923605123|Crash|Whenever you try to use the graph function.  I expect better from Schwab|4.1.3.6|1|

Consider something akin to the following: 考虑类似于以下内容:

xmlstarlet sel -t -m '//entry' \
  -v ./company -o '|' \
  -v ./appname -o '|' \
  -v ./appid   -o '|' \
  -v ./content -n     \
  <test.xml

It would be possible to write a query which didn't call for spelling out each column in turn -- but writing it out is the better approach, as it ensures that column 3 in every line (in this case) always means appid, which otherwise isn't a guarantee that you have available. 可以编写一个查询,而不需要依次拼出每一列,但是写出来是更好的方法,因为它可以确保每行(在这种情况下)第3列始终表示appid,否则不能保证您有空。

Note that XMLStarlet, like many compliant parsers, requires a well-formed XML document -- meaning it the document being processed must have a single root element. 请注意,与许多兼容的解析器一样,XMLStarlet也需要格式正确的XML文档-这意味着正在处理的文档必须具有单个根元素。 If what you have is a file that contains a stream of documents (no root element in which the entries are contained), this can be faked; 如果您所拥有的是一个包含文档流的文件(不包含条目的根元素),则可以伪造; one ugly but functional way to do this follows: xmlstarlet ... < <(echo "<root>"; cat test.xml; echo "</root>") ) 一种丑陋但实用的方法如下: xmlstarlet ... < <(echo "<root>"; cat test.xml; echo "</root>") ))

sed 's/<[^>]*>/|/g;s/||*/|/g' file1 > file2

Edited to remove ajacent "||" 编辑以删除相邻的“ ||” pairs

awk '$1 {printf s++ ? "|" $0 : $0}' RS='<[^>]+>'
  • set Record Separator to a tag, example <entry> 将“记录分隔符”设置为标签,例如<entry>
  • only print "lines" with a field, AKA dont print the tags 只打印带有字段的“行”,AKA不打印标签
  • if on the second "line" or more, print a | 如果在第二行或更多行上,则打印| , otherwise just print the "line" ,否则只需打印“行”

Result 结果

ABC|XYZ|12345678|2014-04-29T20:58:00-07:00|923605123|Crash|Whenever you try to use the graph function.  I expect better from Schwab|4.1.3.6|1

With :

xidel -s input.xml -e 'join(entry/*,"|")'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM