简体   繁体   中英

PHP - Parse html to retrieve href from an “a” tag that is inside an other “a” tag

I've been searching for hours (there shouldn't be any duplicate) and tried many different ways using both regex (regular expressions) and DOMdocument without success.

How the non-standard html code looks like:

<a class="SOMECLASS" href="javascript:__FUNCTION(SOME_HREF_INSIDE)" onclick="SOME_JS_FUNCTION();" id="SOME_ID" style="SOME_STYLE">
    <a href="SOME_URL_3">SOME TEXT</a>
</a>

Now the problem is I'm trying to get the url "SOME_URL_3" and both when parsing using regex or DOMdocument, the pasing stops as soon as it encounters the first href. Of course as the second "a" tag is part of the first one, the parser only see it as one.

I observed that browsers seems to automatically separate the tags when parsing as follow:

Before:

<a href="SOME_URL">
    <a href="SOME_URL_2">
    </a>
</a>

After:

<a href="SOME_URL">
</a>
<a href="SOME_URL_2">
</a>

I've not been able to replicate this browsers behavior using php.

What I have tried that came closer to work:

$dom = new DOMDocument();
@$dom->loadHTML($result);

foreach($dom->getElementsByTagName('a') as $link) { 
    $href_count = 0;
    $attrs = array();

    for ($i = 0; $i < $link->attributes->length; ++$i) {
        $node = $link->attributes->item($i);
        if ($node->nodeName == "href") {
            $attrs[$node->nodeName][$href_count] = $node->nodeValue;
            $href_count++;
            if ($href_count >= 2) {
                echo "A second href has been found";
            }
        }
    }

    echo "<pre>";
    var_dump($attrs);
    echo "</pre>";
}

As you may expect it unfortunately doesn't work, in that case I wouldn't be here asking for help...

Please don't hesitate to share your knowledge, any help or suggestion will be greatly appreciated!


Update

I had forgotten to specify in my initial question that the answer should still allow to capture standard href. My goal is to "extend" or "improve" my actual html parser to ensure I'm also retrieving the urls from any href. My initial code was only using RegEx and I wasn't able to capture second href from nested "a" tags. A perfect answer would allow to capture both nested and standard href. Brandon White's solution is perfect for nested href only but it would be resource consuming to use two different RegEx (nested/standard) to parse the entire html content twice. An ideal solution would be a RegEx allowing to capture both at the same time, if this is possible.

You can actually do what you're asking with some pretty fancy RegEx. Using Negative Lookahead and some logic, you can actually extract the nested href location altogether.

Example

$result = <<<HTML
<a href="SOME_URL">
    <a href="SOME_URL_2">
    </a>
</a>

<a href="SOME_URL3">
    <a href="SOME_URL_4">
    </a>
</a>

<a href="SOME_URL5">
</a>
<a href="SOME_URL_6">
</a>

HTML;

preg_match_all('/<a.*>(?!<\/a>)\s*<a.*href\s*=\s*"(.+)"/', $result, $matches);

var_dump($matches);

Explanation

RegEx is VERY handy in these tricky situations. There is no need, thankfully, for all of the logic you were attempting above. All you need is some logic and knowledge of RegEx. A site I always recommend is RegExr . It is very helpful to analyze and build working RegEx. In fact, here is a RegEx "Fiddle" of the example.

  • <a.*> This matches any first anchor tag
  • (?!<\\/a>) This is a negative lookahead - which checks to make sure there is NOT a closing anchor tag following. This assures it is a nested anchor match.
  • \\s* Matches any possible white-space between the two anchors.
  • <a.*href\\s*=\\s*"(.+)" This matches the second anchor tag written with any possible spaces between the href attribute and = and value. Also, the (.+) places the URL into a capturing group . Using the preg_match_all() function, it will be the second row in the $match array. See the example output below.
  • Also notice, it doesn't extract the non-nested URLs like shown in your code example above.

Output of Code

上面代码示例的输出

I've been able to achieve my goal using the solution below:

$result = <<<HTML
<a href="SOME_URL">
    <a href="SOME_URL_2">
    </a>
</a>

<a href="SOME_URL3">
    <a href="SOME_URL_4">
    </a>
</a>

<a href="SOME_URL_5">
</a>
<a href="SOME_URL_6">
</a>

HTML;

$dom = new DOMDocument();
@$dom->loadHTML($result);


foreach($dom->getElementsByTagName('a') as $link) {

    $tag_html = $dom->saveHTML($link); //Get tag inner html

    if (substr_count($tag_html, "href") > 1) { //If tag contains more than one href attribute
        preg_match_all('/href="([^"]*)"/is', $tag_html, $link_output, PREG_SET_ORDER);
        $output[] = $link_output[1][1]; //Output second href
    } else { //Not nested tag
        $output[] = $link->getAttribute('href'); //Output first href
    }
}

echo "<pre>".print_r($output)."</pre>";

Output:

array
(
    [0] => SOME_URL_2
    [1] => SOME_URL_4
    [2] => SOME_URL_5
    [3] => SOME_URL_6
)

This solution works with entire html pages with mixed and/or nested content. It allows to capture as many nested href as needed while still capturing standard href "a" tags.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM