BASH脚本遍历XML文件中的ID列表并将名称打印/输出到shell或输出文件？

Question

我正在寻找遍历与XML文件中的ID号匹配的ID号列表，并使用BASH（和AWK）将以下行打印到shell或将其重定向到第三个输出文件（output.txt）

这是细分：

ID_list.txt（此示例简称-它有100个ID）

XML_example.txt（数千个条目）

<book>
  <ID>4414</ID>
  <name>Name of first book</name>
</book>
<book>
  <ID>4561</ID>
  <name>Name of second book</name>
</book>

我希望脚本的输出是第一个文件中100个ID的名称：

Name of first book
Name of second book
etc

我相信可以使用带有for循环的BASH和AWK来做到这一点（对于文件1中的每个文件，在file2中找到对应的名称）。 我认为您可以检索GREP以获取ID号，然后使用AWK在其下方打印一行。 即使输出看起来像这样，我也可以在之后删除XML标签：

<name>Name of first book</name>
<name>Name of second book</name>

它在Linux服务器上，但是我可以将其移植到Windows上的PowerShell。 我认为BASH / GREP和AWK是必经之路。

有人可以帮我编写脚本吗？

Answer 1

给定一个ID，您可以使用XPath xmllint和xmllint命令获得名称，如下所示：

id=4414
name=$(xmllint --xpath "string(//book[ID[text()='$id']]/name)" books.xml)

因此，您可以编写如下内容：

while read id; do
    name=$(xmllint --xpath "string(//book[ID[text()='$id']]/name)" books.xml)
    echo "$name"
done < id_list.txt

与涉及awk ， grep和friends的解决方案不同，它使用的是实际的XML解析工具。 这意味着，尽管大多数其他解决方案在遇到以下情况时可能会中断：

<book><ID>4561</ID><name>Name of second book</name></book>

...这会很好。

xmllint是libxml2软件包的一部分，并且在大多数发行版中都可用。

还请注意，awk的最新版本具有本机XML解析功能。

Answer 2

$ awk '
NR==FNR{ ids["<ID>" $0 "</ID>"]; next }
found { gsub(/^.*<name>|<[/]name>.*$/,""); print; found=0 }
$1 in ids { found=1 }
' ID_list.txt XML_example.txt
Name of first book
Name of second book

Answer 3

这是一种方法：

while IFS= read -r id
do
    grep -A1 "<ID>$id</ID>" XML_example.txt | grep "<name>"
done < ID_list.txt

这是另一种方式（单线）。 这效率更高，因为它使用单个grep提取所有id，而不是循环：

egrep -A1 $(sed -e 's/^/<ID>/g' -e 's/$/<\/ID>/g' ID_list.txt | sed -e :a -e '$!N;s/\n/|/;ta' ) XML_example.txt | grep "<name>"

输出：

<name>Name of first book</name>
<name>Name of second book</name>

Answer 4

如果必须用bash进行，我会走BASH_REMATCH路线

 BASH_REMATCH
          An  array  variable  whose members are assigned by the =~ binary
          operator to the [[ conditional command.  The element with  index
          0  is  the  portion  of  the  string matching the entire regular
          expression.  The element with index n  is  the  portion  of  the
          string matching the nth parenthesized subexpression.  This vari‐
          able is read-only.

所以像下面这样

#!/bin/bash

while read -r line; do
  [[ $print ]] && [[ $line =~ "<name>"(.*)"</name>" ]] && echo "${BASH_REMATCH[1]}"

  if [[ $line == "<ID>"*"</ID>" ]]; then
    print=:
  else
    print=
  fi
done < "ID_list.txt"

输出示例

> abovescript
Name of first book
Name of second book

BASH脚本遍历XML文件中的ID列表并将名称打印/输出到shell或输出文件？

问题描述

4 个解决方案

解决方案1
3 2014-01-21 18:00:27

解决方案2
1 2014-01-21 17:56:22

解决方案3
1 已采纳 2014-01-21 17:56:45

解决方案4
0 2014-01-22 10:29:17

BASH脚本遍历XML文件中的ID列表并将名称打印/输出到shell或输出文件？

问题描述

4 个解决方案

解决方案1 3 2014-01-21 18:00:27

解决方案2 1 2014-01-21 17:56:22

解决方案3 1 已采纳 2014-01-21 17:56:45

解决方案4 0 2014-01-22 10:29:17

解决方案1
3 2014-01-21 18:00:27

解决方案2
1 2014-01-21 17:56:22

解决方案3
1 已采纳 2014-01-21 17:56:45

解决方案4
0 2014-01-22 10:29:17