php正则表达式以匹配特定的url模式

Question

I'd like to "grab" a few hundred urls from a few hundred html pages. 我想从几百个html页面中“抓取”几百个URL。

Pattern: 图案：

<h2><a href="http://www.the.url.might.be.long/urls.asp?urlid=1" target="_blank">The Website</a></h2>

Answer 1

Here is how to do it properly with the native DOM extensions 这是使用本机DOM扩展正确执行操作的方法

// GET file
$doc = new DOMDocument;
$doc->loadHtmlFile('http://example.com/');

// Run XPath to fetch all href attributes from a elements
$xpath = new DOMXPath($doc);
$links = $xpath->query('//a/@href');

// collect href attribute values from all DomAttr in array
$urls = array();
foreach($links as $link) {
    $urls[] = $link->value;
}
print_r($urls);

Note that the above will also find relative links. 请注意，上面还会找到相关链接。 If you don't want those adjust the Xpath to 如果您不希望将Xpath调整为

'//a/@href[starts-with(., "http")]'

Note that using Regex to match HTML is the road to madness. 请注意，使用正则表达式匹配HTML是通往疯狂之路。 Regex matches string patterns and knows nothing about HTML elements and attributes. 正则表达式匹配字符串模式，并且对HTML元素和属性一无所知。 DOM does, which is why you should prefer it over Regex for every situation that goes beyond matching a supertrivial string pattern from Markup. DOM做到了，这就是为什么除了匹配来自Markup的超简单字符串模式之外的所有情况，您都应该在Regex上使用它的原因。

Answer 2

'/http:\/\/[^\/]+/[^.]+\.asp\?urlid=\d+/'

But better use HTML Parser, an example here with PHP Simple HTML DOM 但是最好使用HTML Parser，这里是PHP Simple HTML DOM的示例

$html = file_get_html('http://www.google.com/');

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

php正则表达式以匹配特定的url模式

问题描述

2 个解决方案

解决方案1
3 2010-03-28 09:20:07

解决方案2
1 2010-03-28 09:02:25

php正则表达式以匹配特定的url模式

问题描述

2 个解决方案

解决方案1 3 2010-03-28 09:20:07

解决方案2 1 2010-03-28 09:02:25

解决方案1
3 2010-03-28 09:20:07

解决方案2
1 2010-03-28 09:02:25