简体   繁体   中英

Search for substring in string using Bash?

How can I extract the currency1 field in the following string:

<fxQuotation><currency1>USD</currency1><currency2>AUD</currency2>

The result should be USD.

The below command would work:

echo "<fxQuotation><currency1>USD</currency1><currency2>AUD</currency2>" | cut -d">" -f3 | cut -d"<" -f1

However what if that string was a substring in a very big xml file, then my command would not work. How can I search based on the currency1 field.

Very easy using xidel :

xidel file.xml --extract "//currency1" -q

or

xidel file.xml --xpath "//currency1" -q

The two work with badly formatted XML/HTML/XML with text...

It's better to use a xml parser or xml querying language instead of regex and bash commands.

For Java see DOM , SAX , StAX etc based xml parsers. DOM loads all of your xml as a tree representation in memory, so it's fast but memory inefficient; on the other hand SAX and StAX are much more better as they handle xml in pull or push fashion firing events. So you just have to write event handlers for their events.
WoodStox library is a good, efficient and sort of configurable xml parser. More info: https://www.javacodegeeks.com/2013/05/parsing-xml-using-dom-sax-and-stax-parser-in-java.html
http://www.studytrails.com/java/xml/woodstox/java-xml-stax-woodstox-basic-parsing.jsp

You can also use SQL like syntax for xml by using XQuery; another language to get your data can be xpath.

http://www.w3schools.com/xsl/xpath_intro.asp
http://www.w3schools.com/xsl/xquery_intro.asp

But if you still insist using bash tools.. just grep your string with -o option to get your desired tag along with its content( -o returns only strings which match regex line by line) and then remove the tags using cut or sed or any other tool:

$ cat file1
text text abcd
cxyz
xyz

</rootelement>
<abcd>
<xyz><fxQuotation><currency1>USD</currency1><currency2>AUD</currency2></fxQuotation></xyz>
</abcd>
</rootelement>
$ egrep -o '<currency1>[^<]*</currency1>' file1
<currency1>USD</currency1>
$ egrep -o '<currency1>[^<]*</currency1>' file1 | sed -r 's/<[^>]*>//g'
USD
$ grep -oP '(?<=<currency1>)[^<]*(?=</currency1>)' file1
USD
$

您最好使用 C 或 Python 中的小型自定义程序,但 'awk' 和 'sed' 是旧工具,可以在 shell 脚本中提供简单的解决方案:请参阅使用 AWK 打印 XML 元素,但最重要的是确保您输入是原始且格式良好的。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM