简体   繁体   中英

Converting two fields of a table in an XML file into CSV using xmllint in bash?

I've got an XML file (converted from HTML) containing fields like this:

<tr>
  <td data-title="Date">2018-01-01</td>
  <td data-title="Version"><a href="https://some-link">25.1</a></td>
</tr>
<tr>
  <td data-title="Date">2018-03-01</td>
  <td data-title="Version"><a href="https://some-link">24.1</a></td>
</tr>

I've been using 'xmllint' to extract single values:

textarea=$(echo "$xml" | xmllint --xpath 'string(//*[@id="content"])' 2>/dev/null )

and multiple values:

list=$(echo "$xml" | xmllint --xpath 'string(/html/body/div/ul)' 2>/dev/null )

but now I want to extract two fields from each record, in CSV format or something similar.

The closest I've got is this:

xpath tr/*[@data-title="Date" or @data-title="Version"]/text()
Object is a Node Set :
Set contains 20 nodes:
1  TEXT
    content=Apr 9, 2018 6:13 PM UTC
2  TEXT
    content=Mar 21, 2018 10:41 PM UTC
3  TEXT
    content=Mar 19, 2018 9:22 PM UTC

Can you show me a way to achieve this with a better xpath?

This is a way to go with xmllint

xmllint --html --xpath '//tr/td[@data-title="Date"] | //tr/td[@data-title="Version"]' test.html | sed -re 's%(</[^>]+>)%\1\n%g'

Output:

<td data-title="Date">2018-01-01</td>
<td data-title="Version"><a href="https://some-link">25.1</a></td>
<td data-title="Date">2018-03-01</td>
<td data-title="Version"><a href="https://some-link">24.1</a></td>
  • Add --html option to signal html input
  • Add // to xpath to search for relative paths. Your xpath does not have any slash at start so that xpath is relative to the current node. On xmllint shell that is related to how you used the cd command.
  • Finally, use the | operator to search for two or more xpaths.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM