简体   繁体   中英

Bash Script sed -e

count_items=`curl -u username:password -L "websitelink" | sed -e 's/<\/title>/<\/title>\n/g' | sed -n -e 's/.*<title>\(.*\)<\/title>.*/\1/p' | wc -l`

Above I have a Bash script that extracts the titles from an XML file, but how do I change the regex so that it extracts a title name from a div tag?

Example: extract title out of: <div id="example""><a href="">title</a></div>

I know it's silly to be done via Bash but I have no choice, any help would be appreciated.

我建议使用xmlstarlet而不是尝试使用正则表达式解析XML。

Parsing XML without a parser is ugly; the SO crowd always strongly recommends against it, and people always insist on doing it anyway. Usually the brute-force, special-case solutions kludged together with the wrong tools fail beyond a certain level of complexity, and then those people are back where they started. You have been warned! ;)

You mention elsewhere that you need to be able to do this on a "plain Linux machine with nothing installed." While you may not find specialized XML parsing tools on every Linux box, these days it's hard to find one that doesn't have Perl installed. Or at least awk. When you hit the limits of what you can do with regular expressions in sed, I recommend going with either awk or perl for a clean, flexible and legible solution. Use of Perl with a "real" Perl XML library would be optimal but in a pinch you can still get a lot done with "out of the box" Perl.

仅针对给出的单行示例:

echo '<div id="example""><a href="">title</a></div>' | sed -E -n 's/(.*<div.*<a href="">)([^<]*)(<.*<\/div>.*)/\2/p'

Using nothing but Bash:

$ string='<div id="example""><a href="">title</a></div>'
$ pattern='.*>([^<]+)<.*'
$ [[ $string =~ $pattern ]]
$ target=${BASH_REMATCH[1]}
$ echo $target
title

There are lots of ways for this to fail. Here's one:

$ string='<div id="example""><a href="">title</a>this text will be grabbed instead</div>'

You can keep trying to make the regex more robust:

pattern='.*>([^<]+)</a.*'

but it's an uphill battle. Use a proper parser.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM