Can I get only the part of the string that matches with Grep

Question

I have some html that I would like to pull a URL from using grep. Is there an elegant way to do this? So far, I'm using wget to dump the html into a tmp.html file. Then, this is what I'm doing:

awk '/<a href=/,/<\/a\>/' tmp.html | grep -v "sha1|md5" |grep -E "*.rpm?" | tail -1

Given a list of the following types of string, I'd like to pull out only the last .rpm URL on the list.

<td><a href="http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm">something-0.0.1-20150227.161014-81-sles11_64.rpm</a></td>

Answer 1

Using GNU awk for the 3rd arg to match() and given this input file:

$ cat file
<td><a href="http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm">something-0.0.1-20150227.161014-81-sles11_64.rpm</a></td>

This might be what you want:

$ cat tst.awk         
match($0,/<a href=.*>(.*\.rpm)<\/a\>/,a) && !/sha1|md5/ {url=a[1]} END{print url}

$ gawk -f tst.awk file
something-0.0.1-20150227.161014-81-sles11_64.rpm

or this:

$ cat tst.awk
match($0,/<a href="([^"]+\.rpm)".*<\/a\>/,a) && !/sha1|md5/ {url=a[1]} END{print url}

$ gawk -f tst.awk file
http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm

but without more sample input and the expected output it's a guess.

Answer 2

The -o option causes grep to print out only the matches, instead of the full line which matches. If there is more than one match in a line, all of them will be printed.

*.rpm? is not a regular expression. If you want to make the match meaningful, you'll need to be quite precise; possibly something like

grep -o '"[^"]*.rpm"'

will give you more or less what you are looking for (but it will output the quotes as well, and will not deal with % -escapes in the URL.

You could probably do better with awk , since you are using that anyway.

Parsing HTML with regular expressions is never going to be as robust nor as easy as using a real HTML parser, as has been observed frequently here .

Can I get only the part of the string that matches with Grep

Question

2 answers

solution1
2 ACCPTED 2015-02-27 19:22:20

solution2
1 2015-02-27 18:12:02

Can I get only the part of the string that matches with Grep

Question

2 answers

solution1 2 ACCPTED 2015-02-27 19:22:20

solution2 1 2015-02-27 18:12:02

solution1
2 ACCPTED 2015-02-27 19:22:20

solution2
1 2015-02-27 18:12:02