使用 sed - shell 脚本从 XML 标记中提取文本

Question

Well I have already write the script which basically takes xml file as input and extract the text for specific XML tags and it's working.好吧，我已经编写了基本上将 xml 文件作为输入并提取特定 XML 标签的文本的脚本，它正在工作。 But it's not smart enough to get the multiline text and also allow special characters.但是获取多行文本并允许特殊字符还不够聪明。 It's very important that text format should be keep intact as it's defined under tags.文本格式应保持完整，因为它是在标签下定义的，这一点非常重要。

Below is the XML input:下面是 XML 输入：

<nick>Deminem</nick>
<company>XYZ Solutions</company>
<description>
  /**
   * 
   *  «Lorem» ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
   *  tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. 
   *  At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd 
   *  no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit 
   *  consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore
   *  magna aliquyam erat, sed diam voluptua.
   *
   **/
</description>

The above script extract the text of each specific tag and assign to new valueArray.上面的脚本提取每个特定标签的文本并分配给新的 valueArray。 My command over sed is basic but always willing to go the extra mile.我对 sed 的命令是基本的，但总是愿意 go 加倍努力。

tagsArray=( nick company description )
noOfElements=${#tagsArray[@]}

for (( i=0;i<$noOfElements;i++)); do

OUT=`grep ${tagsArray[${i}]} filename.xml | tr -d '\t' | sed -e 's/^<.*>\([^<].*\)<.*>$/\1/' `

valueArray[${i}]=${OUT}
done

Answer 1

Parsing XML with regexp leads to trouble eventually, just as you have experienced.正如您所经历的那样，使用正则表达式解析 XML 最终会导致麻烦。 Take the time to learn enough XSL (there are many tutorials ) to transform the XML properly, using for example xsltproc .花时间学习足够的XSL （有很多教程）来正确转换 XML，例如使用xsltproc 。

Edit:编辑：

After trying out a few command line xml utilities, I think xmlstarlet could be the tool for you.在尝试了几个命令行 xml 实用程序之后，我认为xmlstarlet可能是适合您的工具。 The following is untested, and assumes that filename.xml is a proper xml file (ie has a single root element).以下内容未经测试，并假设filename.xml是正确的 xml 文件（即具有单个根元素）。

tagsArray=( nick company description )
noOfElements=${#tagsArray[@]}

for (( i=0;i<$noOfElements;i++)); do
    valueArray[${i}] = `xmlstarlet sel -t -v "/root/$tagsArray[i]" filename.xml`
done

Answer 2

#!/bin/sh
filePath=$1 #XML file path
tagName=$2  #Tag name to fetch values
awk '!/<.*>/' RS="<"$tagName">|</"$tagName">" $filePath

使用 sed - shell 脚本从 XML 标记中提取文本

问题描述

2 个解决方案

解决方案1
3 2011-04-27 19:11:33

解决方案2
0 2012-04-19 05:39:51

使用 sed - shell 脚本从 XML 标记中提取文本

问题描述

2 个解决方案

解决方案1 3 2011-04-27 19:11:33

解决方案2 0 2012-04-19 05:39:51

解决方案1
3 2011-04-27 19:11:33

解决方案2
0 2012-04-19 05:39:51