I have some html that I would like to pull a URL from using grep. Is there an elegant way to do this? So far, I'm using wget to dump the html into a tmp.html file. Then, this is what I'm doing:
awk '/<a href=/,/<\/a\>/' tmp.html | grep -v "sha1|md5" |grep -E "*.rpm?" | tail -1
Given a list of the following types of string, I'd like to pull out only the last .rpm URL on the list.
<td><a href="http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm">something-0.0.1-20150227.161014-81-sles11_64.rpm</a></td>
Using GNU awk for the 3rd arg to match() and given this input file:
$ cat file
<td><a href="http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm">something-0.0.1-20150227.161014-81-sles11_64.rpm</a></td>
This might be what you want:
$ cat tst.awk
match($0,/<a href=.*>(.*\.rpm)<\/a\>/,a) && !/sha1|md5/ {url=a[1]} END{print url}
$ gawk -f tst.awk file
something-0.0.1-20150227.161014-81-sles11_64.rpm
or this:
$ cat tst.awk
match($0,/<a href="([^"]+\.rpm)".*<\/a\>/,a) && !/sha1|md5/ {url=a[1]} END{print url}
$ gawk -f tst.awk file
http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm
but without more sample input and the expected output it's a guess.
The -o
option causes grep to print out only the matches, instead of the full line which matches. If there is more than one match in a line, all of them will be printed.
*.rpm?
is not a regular expression. If you want to make the match meaningful, you'll need to be quite precise; possibly something like
grep -o '"[^"]*.rpm"'
will give you more or less what you are looking for (but it will output the quotes as well, and will not deal with %
-escapes in the URL.
You could probably do better with awk
, since you are using that anyway.
Parsing HTML with regular expressions is never going to be as robust nor as easy as using a real HTML parser, as has been observed frequently here .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.