如何使用正则表达式从html标记中提取网址和文本

Question

<!-- This Div repeated in HTML with different properties value -->

<div style="position:absolute; overflow:hidden; left:220px; top:785px; width:347px; height:18px; z-index:36">

<!-- Only Unique Thing is This in few pages -->
<a href="http://link.domain.com/?id=123" target="_parent">

<!-- OR in some pages Only Unique Thing is This, ending with mp3 extension -->
<a href="http://domain.com/song-title.mp3" target="_parent">

    <!-- This Div also repeated multiple in HTML -->

    <FONT style="font-size:10pt" color=#000000 face="Tahoma">
        <DIV><B>Harjaiyaan</B> - Nandini Srikar</DIV>
    </FONT>
</a>

</DIV>

我们有非常脏的html标记，它由某些程序或应用程序生成。 我们想从此代码中提取“Urls”以及“Text”。

在一个href我们使用两种类型的网址，网址1模式：' http ： //link.domain.com/id=123 '，网址2模式：' http ： //domain.com/sons-title.mp3 '

在第一场比赛中，我们是特定模式，但在第二个网址中我们没有模式只是以“.mp3”扩展名结尾的网址。

是否有一些函数可以从这个模式和文本代码中提取url ？

注意：没有DOM，有没有办法匹配href和文本与正则表达式之间？ preg_match？

Answer 1

使用DOMDocument类并继续这样做。

$dom = new DOMDocument;
$dom->loadHTML($html); //<------- Pass ur HTML source here
foreach ($dom->getElementsByTagName('a') as $tag) {

        echo $tag->getAttribute('href');
        echo $tag->nodeValue; // to get the content in between of tags...

}

Answer 2

扩展@Shankar Damodaran的答案：

$html = file_get_contents('source.htm');

$dom = new DOMDocument;
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $tag) {

    if (strstr($tag->getAttribute('href'),'?id=') !== false) {
        echo $tag->getAttribute('href') . "<br>\n";
    }

}

然后为MP3做同样的事情：

$html = file_get_contents('source.htm');

$dom = new DOMDocument;
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $tag) {

    if (strstr($tag->getAttribute('href'),'.mp3') !== false) {
        echo $tag->getAttribute('href') . "<br>\n";
    }

}

如何使用正则表达式从html标记中提取网址和文本

问题描述

2 个解决方案

解决方案1
2 2014-02-08 11:04:35

解决方案2
1 2014-02-08 11:38:28

如何使用正则表达式从html标记中提取网址和文本

问题描述

2 个解决方案

解决方案1 2 2014-02-08 11:04:35

解决方案2 1 2014-02-08 11:38:28

解决方案1
2 2014-02-08 11:04:35

解决方案2
1 2014-02-08 11:38:28