简体   繁体   中英

extract specific URLs from text

I want to extract URLs from this text:

<body>
<a href="http://domaine.com/t/text/text"> <img src="http://domaine.com/i/text/text"></a> <br>
<a href="http://domaine.com/text"></a> <br>
<a href="http://domaine.com"></a> <br>
<a href="http://domaine.com/text/text"></a> <br>
<a href="http://[GoTo]"></a> <br>
<a href="http://[NextURL]"></a> <br>
</body>

but i want to exclude some URLs with specific patterns from being extracted; those patterns are:

http://***/i/***/***
http://***/t/***/***
http://[GoTo]
http://[NextURL]

which means i will just get this URLs as a result:

http://domaine.com/text
http://domaine.com
http://domaine.com/text/text

what i did so far is using this Regex:

$regex = '/https?\:\/\/[^\" ]+/i';
preg_match_all($regex, $string, $matches);
print_r($matches[0]);

but as you can notice i get all the URLs extracted, and i don't know how to exclude some of them using my specific petterns.

What you are looking for is a negative lookahead:

$regex = '/https?:\/\/(?!\[GoTo\]|\[NextURL\]|[^\" ]*\/i\/[^\" ]+|[^\" ]*\/t\/[^\" ]*)[^\" ]+/i';

?! at the beginning of a submatch should prevent matching for URLs with the enclosed pattern. This might need tweaking for specific corner cases, but with the problem as stated, this should get you what you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM