简体   繁体   中英

How to print only matches with sed?

Okay, this is an easy one, but I can't figure it out.

Basically I want to extract all links ( <a href="[^<>]*">[^<>]*</a> ) from a big html file.

I tried to do this with sed , but I get all kinds of results, just not what I want. I know that my regexp is correct, because I can replace all the links in a file:

sed 's_<a href="[^<>]*">[^<>]*</a>_TEST_g'

If I run that on something like

<div><a href="http://wwww.google.com">A google link</a></div>
<div><a href="http://wwww.google.com">A google link</a></div>

I get

<div>TEST</div>
<div>TEST</div>

How can I get rid of everything else and just print the matches instead? My preferred end result would be:

<a href="http://wwww.google.com">A google link</a>
<a href="http://wwww.google.com">A google link</a>

PS. I know that my regexp is not the most flexible one, but it's enough for my intentions.

Match the whole line, put the interesting part in a group, replace by the content of the group. Use the -n option to suppress non-matching lines, and add the p modifier to print the result of the s command.

sed -n -e 's!^.*\(<[Aa] [^<>]*>.*</[Aa]>\).*$!\1!p'

Note that if there are multiple links on the line, this only prints the last link. You can improve on that, but it goes beyond simple sed usage. The simplest method is to use two steps: first insert a newline before any two links, then extract the links.

sed -n -e 's!</a>!&\n!p' | sed -n -e 's!^.*\(<[Aa] [^<>]*>.*</[Aa]>\).*$!\1!p'

This still doesn't handle HTML comments, <pre> , links that are spread over several lines, etc. When parsing HTML, use an HTML parser .

Assuming that there is only one hyperlink per line the following may work...

sed -e 's_.*&lta href=_&lta href=_' -e 's_>.*_>ed <<'EOF'
 -e 's_.*&lta href=_&lta href=_' -e 's_>.*_>_'

If you don't mind using perl like sed it can copy with very diverse input:

perl -n -e 's+(<a href=.*?</a>)+ print $1, "\n" +eg;'

这可能对您有用(GNU sed):

sed '/<a href\>/!d;s//\n&/;s/[^\n]*\n//;:a;$!{/>/!{N;ba}};y/\n/ /;s//&\n/;P;D' file

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM