简体   繁体   English

php正则表达式以匹配特定的url模式

[英]php regular expression to match specific url pattern

I'd like to "grab" a few hundred urls from a few hundred html pages. 我想从几百个html页面中“抓取”几百个URL。

Pattern: 图案:

<h2><a href="http://www.the.url.might.be.long/urls.asp?urlid=1" target="_blank">The Website</a></h2>

Here is how to do it properly with the native DOM extensions 这是使用本机DOM扩展正确执行操作的方法

// GET file
$doc = new DOMDocument;
$doc->loadHtmlFile('http://example.com/');

// Run XPath to fetch all href attributes from a elements
$xpath = new DOMXPath($doc);
$links = $xpath->query('//a/@href');

// collect href attribute values from all DomAttr in array
$urls = array();
foreach($links as $link) {
    $urls[] = $link->value;
}
print_r($urls);

Note that the above will also find relative links. 请注意,上面还会找到相关链接。 If you don't want those adjust the Xpath to 如果您不希望将Xpath调整为

'//a/@href[starts-with(., "http")]'

Note that using Regex to match HTML is the road to madness. 请注意,使用正则表达式匹配HTML是通往疯狂之路。 Regex matches string patterns and knows nothing about HTML elements and attributes. 正则表达式匹配字符串模式,并且对HTML元素和属性一无所知。 DOM does, which is why you should prefer it over Regex for every situation that goes beyond matching a supertrivial string pattern from Markup. DOM做到了,这就是为什么除了匹配来自Markup的超简单字符串模式之外的所有情况,您都应该在Regex上使用它的原因。

'/http:\/\/[^\/]+/[^.]+\.asp\?urlid=\d+/'

But better use HTML Parser, an example here with PHP Simple HTML DOM 但是最好使用HTML Parser,这里是PHP Simple HTML DOM的示例

$html = file_get_html('http://www.google.com/');

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM