(PHP) 用於查找特定 href 標記的正則表達式

Question

我有一個 html 文檔，其中有 n 個“a href”標簽，具有不同的目標 URL 和標簽之間的不同文本。

例如：

<a href="http://www.example.com/d?12345abc" name="example"><span ....>lorem ipsum</span></a>
<a href="http://www.example.com/d/d?abc1234" name="example2"><span ....>example</span></a>
<a href="http://www.example.com/d.1234" name="example3">example3</a>
<a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>
<a href="http://www.example.com/without_d/1234" name="example3">without a d as target url</a>

如您所見，目標網址在“d?, d., d/d?, d/d.”之間切換。 在“a 標簽”之間可以有 w3c 允許的任何類型的 html。

我需要一個正則表達式，它為我提供了在目標 url 中具有這些組合之一的所有鏈接：“d?, d., d/d?, d/d.” 並且在任何 position 中的“a 標簽”之間具有“Lorem”或“test”，包括子 html 標簽。

到目前為止我的正則表達式：

href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>.*?</a>)

我試圖包括如下的lorem / test：

href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>(lorem|test)+</a>)

但這只有在我放一個“。*？”時才有效。 在 (lorem|test) 之前和之后，這將是貪婪的。

如果 SimpleXml 或任何其他 DOM 解析器有更簡單的方法，請告訴我。 否則，我將不勝感激正則表達式的任何幫助。

謝謝！

Answer 1

這里是 go：

$html = array
(
    '<a href="http://www.example.com/d?12345abc" name="example"><span ....>lorem ipsum</span></a>',
    '<a href="http://www.example.com/d/d?abc1234" name="example2"><span ....>example</span></a>',
    '<a href="http://www.example.com/d.1234" name="example3">example3</a>',
    '<a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>',
    '<a href="http://www.example.com/without_d/1234" name="example3">without a d as target url</a>',
);

$html = implode("\n", $html);
$result = array();
$anchors = phXML($html, '//a[contains(., "lorem") or contains(., "test")]');

foreach ($anchors as $anchor)
{
    if (preg_match('~d[.?]~', strval($anchor['href'])) > 0)
    {
        $result[] = strval($anchor['href']);
    }
}

echo '<pre>';
print_r($result);
echo '</pre>';

Output：

Array
(
    [0] => http://www.example.com/d?12345abc
    [1] => http://www.example.com/d/d.1234
)

phXML() function 基於我的 DOMDocument / SimpleXML 包裝器，如下所示：

function phXML($xml, $xpath = null)
{
    if (extension_loaded('libxml') === true)
    {
        libxml_use_internal_errors(true);

        if ((extension_loaded('dom') === true) && (extension_loaded('SimpleXML') === true))
        {
            if (is_string($xml) === true)
            {
                $dom = new DOMDocument();

                if (@$dom->loadHTML($xml) === true)
                {
                    return phXML(@simplexml_import_dom($dom), $xpath);
                }
            }

            else if ((is_object($xml) === true) && (strcmp('SimpleXMLElement', get_class($xml)) === 0))
            {
                if (isset($xpath) === true)
                {
                    $xml = $xml->xpath($xpath);
                }

                return $xml;
            }
        }
    }

    return false;
}

我現在懶得不使用這個 function，但如果你需要，我相信你可以擺脫它。

Answer 2

這是一個有效的正則表達式：

$search = '/<a\s[^>]*href=["\'](?:http:\/\/)?(?:[a-z0-9-]+(?:\.[a-z0-9-]+)*)\/(?:d\/)?d[?.].*?>.*?(?:lorem|test)+.*?<\/a>/i';
$matches = array();
preg_match_all($search, $html, $matches);

唯一的問題是它依賴於每個 ` 標簽之間有一個換行符。 否則它將匹配如下內容：

<a href="http://www.example.com/d.1234" name="example3">example3</a><a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>

Answer 3

使用 HTML 解析器。 Regex 絕對不是解析 HTML 的解決方案有很多原因。

這里有一個很好的列表： Robust and Mature HTML Parser for PHP

Answer 4

將僅打印第一個和第四個鏈接，因為滿足兩個條件。

preg_match_all('#href="(.*?)"(.*?)>(.*?)</a>#is', $string, $matches);
$count = count($matches[0]);
unset($matches[0], $matches[2]);

for($i = 0; $i < $count; $i++){

    if(
        strpos($matches[1][$i], '/d') !== false 
        &&
        preg_match('#(lorem|test)#is', $matches[3][$i]) == true
    )
    {
        echo $matches[1][$i];    
    }

}

(PHP) 用於查找特定 href 標記的正則表達式

問題描述

4 個解決方案

解決方案1
2 已采納 2011-07-18 01:18:56

解決方案2
1 2011-07-18 01:21:34

解決方案3
0 2011-07-18 01:02:40

解決方案4
0 2011-07-18 01:19:42

(PHP) 用於查找特定 href 標記的正則表達式

問題描述

4 個解決方案

解決方案1 2 已采納 2011-07-18 01:18:56

解決方案2 1 2011-07-18 01:21:34

解決方案3 0 2011-07-18 01:02:40

解決方案4 0 2011-07-18 01:19:42

解決方案1
2 已采納 2011-07-18 01:18:56

解決方案2
1 2011-07-18 01:21:34

解決方案3
0 2011-07-18 01:02:40

解決方案4
0 2011-07-18 01:19:42