简体   繁体   中英

php regular expression to match specific url pattern

I'd like to "grab" a few hundred urls from a few hundred html pages.

Pattern:

<h2><a href="http://www.the.url.might.be.long/urls.asp?urlid=1" target="_blank">The Website</a></h2>

Here is how to do it properly with the native DOM extensions

// GET file
$doc = new DOMDocument;
$doc->loadHtmlFile('http://example.com/');

// Run XPath to fetch all href attributes from a elements
$xpath = new DOMXPath($doc);
$links = $xpath->query('//a/@href');

// collect href attribute values from all DomAttr in array
$urls = array();
foreach($links as $link) {
    $urls[] = $link->value;
}
print_r($urls);

Note that the above will also find relative links. If you don't want those adjust the Xpath to

'//a/@href[starts-with(., "http")]'

Note that using Regex to match HTML is the road to madness. Regex matches string patterns and knows nothing about HTML elements and attributes. DOM does, which is why you should prefer it over Regex for every situation that goes beyond matching a supertrivial string pattern from Markup.

'/http:\/\/[^\/]+/[^.]+\.asp\?urlid=\d+/'

But better use HTML Parser, an example here with PHP Simple HTML DOM

$html = file_get_html('http://www.google.com/');

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM