简体   繁体   中英

Can I get only the part of the string that matches with Grep

I have some html that I would like to pull a URL from using grep. Is there an elegant way to do this? So far, I'm using wget to dump the html into a tmp.html file. Then, this is what I'm doing:

awk '/<a href=/,/<\/a\>/' tmp.html | grep -v "sha1|md5" |grep -E "*.rpm?" | tail -1

Given a list of the following types of string, I'd like to pull out only the last .rpm URL on the list.

<td><a href="http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm">something-0.0.1-20150227.161014-81-sles11_64.rpm</a></td>

Using GNU awk for the 3rd arg to match() and given this input file:

$ cat file
<td><a href="http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm">something-0.0.1-20150227.161014-81-sles11_64.rpm</a></td>

This might be what you want:

$ cat tst.awk         
match($0,/<a href=.*>(.*\.rpm)<\/a\>/,a) && !/sha1|md5/ {url=a[1]} END{print url}

$ gawk -f tst.awk file
something-0.0.1-20150227.161014-81-sles11_64.rpm

or this:

$ cat tst.awk
match($0,/<a href="([^"]+\.rpm)".*<\/a\>/,a) && !/sha1|md5/ {url=a[1]} END{print url}

$ gawk -f tst.awk file
http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm

but without more sample input and the expected output it's a guess.

The -o option causes grep to print out only the matches, instead of the full line which matches. If there is more than one match in a line, all of them will be printed.

*.rpm? is not a regular expression. If you want to make the match meaningful, you'll need to be quite precise; possibly something like

grep -o '"[^"]*.rpm"'

will give you more or less what you are looking for (but it will output the quotes as well, and will not deal with % -escapes in the URL.

You could probably do better with awk , since you are using that anyway.

Parsing HTML with regular expressions is never going to be as robust nor as easy as using a real HTML parser, as has been observed frequently here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM