如何使用正則表達式解析此HTML？

Question

我正在嘗試編寫一個正則表達式，以從HTML源中提取URL列表的href和anchor文本。 anchor文本可以是任何值。

HTML部分如下：

<div class="links"><a rel="nofollow" target="_blank" href="http://url1.com" class="get-all">URL1</a><a rel="nofollow" target="_blank" href="http://url2.com" class="get-all">This is Url-2</a><a rel="nofollow" target="_blank" href="http://url3.com" class="get-all">This is Url-3</a><a rel="nofollow" target="_blank" href="http://url4.com" class="get-all">Sweet URL 4</a></div>

我嘗試了以下正則表達式，但由於</a>標記之前的所有內容</a>並失敗，因此它無法正常工作。

preg_match_('/<a rel="nofollow" target="_blank" href="(.*)" class="see-all">(.*)<\/a>/', $source , $website_array);

提取所需數據的有效正則表達式是什么？

Answer 1

如果您必須知道，該表達式是貪婪的，因此它很可能與第一個錨點的開始和最后一個錨點的結束匹配； /U修飾符可以解決以下問題：

preg_match('/<a rel="nofollow" target="_blank" href="(.*)" class="see-all">(.*)<\/a>/U', $source , $website_array);

請注意pcre.backtrack_limit適用於非貪婪模式。

使用預讀集可能會提供更好的性能：

preg_match('/<a rel="nofollow" target="_blank" href="([^"]*)" class="see-all">([^<]*)<\/a>/', $source , $website_array);

這將對錨自身內部的標簽造成麻煩。

由於上述限制，我將認真考慮使用HTML解析器：

$d = new DOMDocument;
$d->loadHTML($source);
$xp = new DOMXPath($d);
foreach ($xp->query('//a[@class="see-all"][@rel="nofollow"][@target="_blank"]') as $anchor) {
    $href = $anchor->getAttribute('href');
    $text = $anchor->nodeValue;
}

演示版

這將以不同的順序愉快地處理屬性，並使您能夠在內部進行進一步查詢，等等。

Answer 2

嘗試

preg_match_all('/<a[^>]+href="([^"]+)"[^>]*>([^>]+)<\/a>/is', $source , $website_array);

它將匹配所有鏈接並返回包含信息的數組。 筆記：

[^“]-匹配除“

Answer 3

雖然使用regex解析HTML通常不是一個好主意（我建議您看一下DOMDocument類以獲得更好的解決方案），但是在某些情況下，如果您對要提取的內容有非常特定的想法，可以放心使用它，並且可以確保在在所有情況下，該可變文本實際上都不會破壞您的正則表達式。

對於您的情況，您可以嘗試：

$pattern = '#<a rel="nofollow" target="_blank" href="(.*)" class="get-all">(.*)</a>#U';
preg_match_all($pattern, $source, $website_array);

請注意最后的非貪婪修飾符（ U ）。 僅匹配最小匹配項非常重要。

Answer 4

另外，您也可以這樣：

<?php
$html = <<<HTML
<div class="links"><a rel="nofollow" target="_blank" href="http://url1.com" class="get-all">URL1</a><a rel="nofollow" target="_blank" href="http://url2.com" class="get-all">This is Url-2</a><a rel="nofollow" target="_blank" href="http://url3.com" class="get-all">This is Url-3</a><a rel="nofollow" target="_blank" href="http://url4.com" class="get-all">Sweet URL 4</a></div>
HTML;


$xml = new DOMDocument();
@$xml->loadHTML($html);

$links=array();
$i=0;
//Get all divs
foreach($xml->getElementsByTagName('div') as $divs) {
    //if this div has a class="links"
    if($divs->getAttribute('class')=='links'){
        //loop through this div
        foreach($xml->getElementsByTagName('a') as $a){
            //if this a tag dose not have a class="get-all" continue to next
            if($a->getAttribute('class')!='get-all')
            continue;

            //Assign values to the links array
            $links[$i]['href']=$a->getAttribute('href');
            $links[$i]['value']=$a->nodeValue;
            $i++;
        }

    }
}

print_r($links);
/*
Array
(
    [0] => Array
        (
            [href] => http://url1.com
            [value] => URL1
        )

    [1] => Array
        (
            [href] => http://url2.com
            [value] => This is Url-2
        )

    [2] => Array
        (
            [href] => http://url3.com
            [value] => This is Url-3
        )

    [3] => Array
        (
            [href] => http://url4.com
            [value] => Sweet URL 4
        )

)
*/
?>

如何使用正則表達式解析此HTML？

問題描述

4 個解決方案

解決方案1
6 已采納 2013-02-19 23:15:57

解決方案2
2 2013-02-19 23:16:30

解決方案3
1 2013-02-19 23:18:21

解決方案4
0 2013-02-19 23:24:39

如何使用正則表達式解析此HTML？

問題描述

4 個解決方案

解決方案1 6 已采納 2013-02-19 23:15:57

解決方案2 2 2013-02-19 23:16:30

解決方案3 1 2013-02-19 23:18:21

解決方案4 0 2013-02-19 23:24:39

解決方案1
6 已采納 2013-02-19 23:15:57

解決方案2
2 2013-02-19 23:16:30

解決方案3
1 2013-02-19 23:18:21

解決方案4
0 2013-02-19 23:24:39