(PHP) 用于查找特定 href 标记的正则表达式

[英](PHP) Regex for finding specific href tag

i have a html document with n "a href" tags with different target urls and different text between the tag.我有一个 html 文档,其中有 n 个“a href”标签,具有不同的目标 URL 和标签之间的不同文本。

For example:例如:

<a href="http://www.example.com/d?12345abc" name="example"><span ....>lorem ipsum</span></a>
<a href="http://www.example.com/d/d?abc1234" name="example2"><span ....>example</span></a>
<a href="http://www.example.com/d.1234" name="example3">example3</a>
<a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>
<a href="http://www.example.com/without_d/1234" name="example3">without a d as target url</a>

As you can see the target urls switch between "d?, d., d/d?, d/d."如您所见,目标网址在“d?, d., d/d?, d/d.”之间切换。 and between the "a tag" there could be any type of html which is allowed by w3c.在“a 标签”之间可以有 w3c 允许的任何类型的 html。

I need a Regex which gives me all links which has one of these combination in the target url: "d?, d., d/d?, d/d."我需要一个正则表达式,它为我提供了在目标 url 中具有这些组合之一的所有链接:“d?, d., d/d?, d/d.” and has "Lorem" or "test" between the "a tags" in any position including sub html tags.并且在任何 position 中的“a 标签”之间具有“Lorem”或“test”,包括子 html 标签。

My Regex so far:到目前为止我的正则表达式:


I tried to include the lorem / test as followed:我试图包括如下的lorem / test:


but this will only works if I put a ".*?"但这只有在我放一个“。*?”时才有效。 before and after the (lorem|test) and this would be to greedy.在 (lorem|test) 之前和之后,这将是贪婪的。

If there is a easier way with SimpleXml or any other DOM parser, please let me know.如果 SimpleXml 或任何其他 DOM 解析器有更简单的方法,请告诉我。 Otherwise I would appreciate any help with the regex.否则,我将不胜感激正则表达式的任何帮助。


Here you go:这里是 go:

$html = array
    '<a href="http://www.example.com/d?12345abc" name="example"><span ....>lorem ipsum</span></a>',
    '<a href="http://www.example.com/d/d?abc1234" name="example2"><span ....>example</span></a>',
    '<a href="http://www.example.com/d.1234" name="example3">example3</a>',
    '<a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>',
    '<a href="http://www.example.com/without_d/1234" name="example3">without a d as target url</a>',

$html = implode("\n", $html);
$result = array();
$anchors = phXML($html, '//a[contains(., "lorem") or contains(., "test")]');

foreach ($anchors as $anchor)
    if (preg_match('~d[.?]~', strval($anchor['href'])) > 0)
        $result[] = strval($anchor['href']);

echo '<pre>';
echo '</pre>';

Output: Output:

    [0] => http://www.example.com/d?12345abc
    [1] => http://www.example.com/d/d.1234

The phXML() function is based on my DOMDocument / SimpleXML wrapper , and goes as follows: phXML() function 基于我的 DOMDocument / SimpleXML 包装器,如下所示:

function phXML($xml, $xpath = null)
    if (extension_loaded('libxml') === true)

        if ((extension_loaded('dom') === true) && (extension_loaded('SimpleXML') === true))
            if (is_string($xml) === true)
                $dom = new DOMDocument();

                if (@$dom->loadHTML($xml) === true)
                    return phXML(@simplexml_import_dom($dom), $xpath);

            else if ((is_object($xml) === true) && (strcmp('SimpleXMLElement', get_class($xml)) === 0))
                if (isset($xpath) === true)
                    $xml = $xml->xpath($xpath);

                return $xml;

    return false;

I'm too lazy not to use this function right now, but I'm sure you can get rid of it if you need to.我现在懒得不使用这个 function,但如果你需要,我相信你可以摆脱它。

Here is a Regular Expression which works:这是一个有效的正则表达式:

$search = '/<a\s[^>]*href=["\'](?:http:\/\/)?(?:[a-z0-9-]+(?:\.[a-z0-9-]+)*)\/(?:d\/)?d[?.].*?>.*?(?:lorem|test)+.*?<\/a>/i';
$matches = array();
preg_match_all($search, $html, $matches);

The only thing is it relies on there being a new-line character between each ` tag.唯一的问题是它依赖于每个 ` 标签之间有一个换行符。 Otherwise it will match something like:否则它将匹配如下内容:

<a href="http://www.example.com/d.1234" name="example3">example3</a><a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>

Use an HTML parser.使用 HTML 解析器。 There are lots of reasons that Regex is absolutely not the solution for parsing HTML. Regex 绝对不是解析 HTML 的解决方案有很多原因。

There's a good list of them here: Robust and Mature HTML Parser for PHP这里有一个很好的列表: Robust and Mature HTML Parser for PHP

Will print only first and fourth link because two conditions are met.将仅打印第一个和第四个链接,因为满足两个条件。

preg_match_all('#href="(.*?)"(.*?)>(.*?)</a>#is', $string, $matches);
$count = count($matches[0]);
unset($matches[0], $matches[2]);

for($i = 0; $i < $count; $i++){

        strpos($matches[1][$i], '/d') !== false 
        preg_match('#(lorem|test)#is', $matches[3][$i]) == true
        echo $matches[1][$i];    


