PHP - Parse html to retrieve href from an “a” tag that is inside an other “a” tag

Question

I've been searching for hours (there shouldn't be any duplicate) and tried many different ways using both regex (regular expressions) and DOMdocument without success.

How the non-standard html code looks like:

<a class="SOMECLASS" href="javascript:__FUNCTION(SOME_HREF_INSIDE)" onclick="SOME_JS_FUNCTION();" id="SOME_ID" style="SOME_STYLE">
    <a href="SOME_URL_3">SOME TEXT</a>
</a>

Now the problem is I'm trying to get the url "SOME_URL_3" and both when parsing using regex or DOMdocument, the pasing stops as soon as it encounters the first href. Of course as the second "a" tag is part of the first one, the parser only see it as one.

I observed that browsers seems to automatically separate the tags when parsing as follow:

Before:

<a href="SOME_URL">
    <a href="SOME_URL_2">
    </a>
</a>

After:

<a href="SOME_URL">
</a>
<a href="SOME_URL_2">
</a>

I've not been able to replicate this browsers behavior using php.

What I have tried that came closer to work:

$dom = new DOMDocument();
@$dom->loadHTML($result);

foreach($dom->getElementsByTagName('a') as $link) { 
    $href_count = 0;
    $attrs = array();

    for ($i = 0; $i < $link->attributes->length; ++$i) {
        $node = $link->attributes->item($i);
        if ($node->nodeName == "href") {
            $attrs[$node->nodeName][$href_count] = $node->nodeValue;
            $href_count++;
            if ($href_count >= 2) {
                echo "A second href has been found";
            }
        }
    }

    echo "<pre>";
    var_dump($attrs);
    echo "</pre>";
}

As you may expect it unfortunately doesn't work, in that case I wouldn't be here asking for help...

Please don't hesitate to share your knowledge, any help or suggestion will be greatly appreciated!

Update

I had forgotten to specify in my initial question that the answer should still allow to capture standard href. My goal is to "extend" or "improve" my actual html parser to ensure I'm also retrieving the urls from any href. My initial code was only using RegEx and I wasn't able to capture second href from nested "a" tags. A perfect answer would allow to capture both nested and standard href. Brandon White's solution is perfect for nested href only but it would be resource consuming to use two different RegEx (nested/standard) to parse the entire html content twice. An ideal solution would be a RegEx allowing to capture both at the same time, if this is possible.

Answer 1

You can actually do what you're asking with some pretty fancy RegEx. Using Negative Lookahead and some logic, you can actually extract the nested href location altogether.

Example

$result = <<<HTML
<a href="SOME_URL">
    <a href="SOME_URL_2">
    </a>
</a>

<a href="SOME_URL3">
    <a href="SOME_URL_4">
    </a>
</a>

<a href="SOME_URL5">
</a>
<a href="SOME_URL_6">
</a>

HTML;

preg_match_all('/<a.*>(?!<\/a>)\s*<a.*href\s*=\s*"(.+)"/', $result, $matches);

var_dump($matches);

Explanation

RegEx is VERY handy in these tricky situations. There is no need, thankfully, for all of the logic you were attempting above. All you need is some logic and knowledge of RegEx. A site I always recommend is RegExr . It is very helpful to analyze and build working RegEx. In fact, here is a RegEx "Fiddle" of the example.

<a.*> This matches any first anchor tag
(?!<\\/a>) This is a negative lookahead - which checks to make sure there is NOT a closing anchor tag following. This assures it is a nested anchor match.
\\s* Matches any possible white-space between the two anchors.
<a.*href\\s*=\\s*"(.+)" This matches the second anchor tag written with any possible spaces between the href attribute and = and value. Also, the (.+) places the URL into a capturing group . Using the preg_match_all() function, it will be the second row in the $match array. See the example output below.
Also notice, it doesn't extract the non-nested URLs like shown in your code example above.

Output of Code

Answer 2

I've been able to achieve my goal using the solution below:

$result = <<<HTML
<a href="SOME_URL">
    <a href="SOME_URL_2">
    </a>
</a>

<a href="SOME_URL3">
    <a href="SOME_URL_4">
    </a>
</a>

<a href="SOME_URL_5">
</a>
<a href="SOME_URL_6">
</a>

HTML;

$dom = new DOMDocument();
@$dom->loadHTML($result);


foreach($dom->getElementsByTagName('a') as $link) {

    $tag_html = $dom->saveHTML($link); //Get tag inner html

    if (substr_count($tag_html, "href") > 1) { //If tag contains more than one href attribute
        preg_match_all('/href="([^"]*)"/is', $tag_html, $link_output, PREG_SET_ORDER);
        $output[] = $link_output[1][1]; //Output second href
    } else { //Not nested tag
        $output[] = $link->getAttribute('href'); //Output first href
    }
}

echo "<pre>".print_r($output)."</pre>";

Output:

array
(
    [0] => SOME_URL_2
    [1] => SOME_URL_4
    [2] => SOME_URL_5
    [3] => SOME_URL_6
)

This solution works with entire html pages with mixed and/or nested content. It allows to capture as many nested href as needed while still capturing standard href "a" tags.

PHP - Parse html to retrieve href from an “a” tag that is inside an other “a” tag

Question

Update

2 answers

solution1
1 2015-10-19 03:27:16

Example

Explanation

Output of Code

solution2
1 ACCPTED 2015-10-19 04:53:23

PHP - Parse html to retrieve href from an “a” tag that is inside an other “a” tag

Question

Update

2 answers

solution1 1 2015-10-19 03:27:16

Example

Explanation

Output of Code

solution2 1 ACCPTED 2015-10-19 04:53:23

solution1
1 2015-10-19 03:27:16

solution2
1 ACCPTED 2015-10-19 04:53:23