简体   繁体   English

PHP-解析html以从另一个“ a”标签内的“ a”标签检索href

[英]PHP - Parse html to retrieve href from an “a” tag that is inside an other “a” tag

I've been searching for hours (there shouldn't be any duplicate) and tried many different ways using both regex (regular expressions) and DOMdocument without success. 我一直在搜索数小时(不应有任何重复),并尝试使用正则表达式(正则表达式)和DOMdocument的许多不同方式,但均未成功。

How the non-standard html code looks like: 非标准html代码的外观如下:

<a class="SOMECLASS" href="javascript:__FUNCTION(SOME_HREF_INSIDE)" onclick="SOME_JS_FUNCTION();" id="SOME_ID" style="SOME_STYLE">
    <a href="SOME_URL_3">SOME TEXT</a>
</a>

Now the problem is I'm trying to get the url "SOME_URL_3" and both when parsing using regex or DOMdocument, the pasing stops as soon as it encounters the first href. 现在的问题是我正在尝试获取URL“ SOME_URL_3”,并且在使用正则表达式或DOMdocument进行解析时,一旦遇到第一个href,粘贴就会停止。 Of course as the second "a" tag is part of the first one, the parser only see it as one. 当然,由于第二个“ a”标记是第一个标记的一部分,因此解析器只会将其视为一个标记。

I observed that browsers seems to automatically separate the tags when parsing as follow: 我观察到,浏览器似乎在解析时会自动将标签分开,如下所示:

Before: 之前:

<a href="SOME_URL">
    <a href="SOME_URL_2">
    </a>
</a>

After: 后:

<a href="SOME_URL">
</a>
<a href="SOME_URL_2">
</a>

I've not been able to replicate this browsers behavior using php. 我无法使用php复制此浏览器的行为。

What I have tried that came closer to work: 我尝试过的东西更接近工作:

$dom = new DOMDocument();
@$dom->loadHTML($result);

foreach($dom->getElementsByTagName('a') as $link) { 
    $href_count = 0;
    $attrs = array();

    for ($i = 0; $i < $link->attributes->length; ++$i) {
        $node = $link->attributes->item($i);
        if ($node->nodeName == "href") {
            $attrs[$node->nodeName][$href_count] = $node->nodeValue;
            $href_count++;
            if ($href_count >= 2) {
                echo "A second href has been found";
            }
        }
    }

    echo "<pre>";
    var_dump($attrs);
    echo "</pre>";
}

As you may expect it unfortunately doesn't work, in that case I wouldn't be here asking for help... 如您所料,不幸的是它不起作用,在那种情况下,我不会在这里寻求帮助...

Please don't hesitate to share your knowledge, any help or suggestion will be greatly appreciated! 请不要犹豫,分享您的知识,任何帮助或建议将不胜感激!


Update 更新

I had forgotten to specify in my initial question that the answer should still allow to capture standard href. 我忘记在最初的问题中指定答案仍应允许捕获标准href。 My goal is to "extend" or "improve" my actual html parser to ensure I'm also retrieving the urls from any href. 我的目标是“扩展”或“改进”我的实际html解析器,以确保我也从任何href中检索了这些url。 My initial code was only using RegEx and I wasn't able to capture second href from nested "a" tags. 我的初始代码仅使用RegEx,但无法从嵌套的“ a”标签捕获第二个href。 A perfect answer would allow to capture both nested and standard href. 一个完美的答案将允许捕获嵌套的和标准的href。 Brandon White's solution is perfect for nested href only but it would be resource consuming to use two different RegEx (nested/standard) to parse the entire html content twice. 布兰登·怀特(Brandon White)的解决方案仅适用于嵌套href,但是使用两个不同的RegEx(嵌套/标准)将整个html内容解析两次会很费资源。 An ideal solution would be a RegEx allowing to capture both at the same time, if this is possible. 理想的解决方案是RegEx,如果可能的话,它可以同时捕获两者。

You can actually do what you're asking with some pretty fancy RegEx. 实际上,您可以使用一些精美的RegEx来完成您要问的事情。 Using Negative Lookahead and some logic, you can actually extract the nested href location altogether. 使用负前瞻和某些逻辑,您实际上可以完全提取嵌套的href位置。

Example

$result = <<<HTML
<a href="SOME_URL">
    <a href="SOME_URL_2">
    </a>
</a>

<a href="SOME_URL3">
    <a href="SOME_URL_4">
    </a>
</a>

<a href="SOME_URL5">
</a>
<a href="SOME_URL_6">
</a>

HTML;

preg_match_all('/<a.*>(?!<\/a>)\s*<a.*href\s*=\s*"(.+)"/', $result, $matches);

var_dump($matches);

Explanation 说明

RegEx is VERY handy in these tricky situations. 在这些棘手的情况下,RegEx非常方便。 There is no need, thankfully, for all of the logic you were attempting above. 值得庆幸的是,您不需要上面所尝试的所有逻辑。 All you need is some logic and knowledge of RegEx. 您需要的只是RegEx的一些逻辑和知识。 A site I always recommend is RegExr . 我一直推荐的网站是RegExr It is very helpful to analyze and build working RegEx. 分析和构建正常运行的RegEx非常有用。 In fact, here is a RegEx "Fiddle" of the example. 实际上,这里是RegEx“提琴”的示例。

  • <a.*> This matches any first anchor tag <a.*>这与任何第一个锚标记匹配
  • (?!<\\/a>) This is a negative lookahead - which checks to make sure there is NOT a closing anchor tag following. (?!<\\/a>)这是一个否定的超前行为 -它检查以确保后面没有结束的锚标记。 This assures it is a nested anchor match. 这确保它是嵌套的锚匹配。
  • \\s* Matches any possible white-space between the two anchors. \\s*匹配两个锚点之间任何可能的空格。
  • <a.*href\\s*=\\s*"(.+)" This matches the second anchor tag written with any possible spaces between the href attribute and = and value. <a.*href\\s*=\\s*"(.+)"与第二个锚标记匹配,在href属性和=和值之间使用任何可能的空格写入。 Also, the (.+) places the URL into a capturing group . 同样, (.+)将URL放入捕获组 Using the preg_match_all() function, it will be the second row in the $match array. 使用preg_match_all()函数,它将是$match数组中的第二行。 See the example output below. 请参阅下面的示例输出。
  • Also notice, it doesn't extract the non-nested URLs like shown in your code example above. 还要注意,它不会像上面的代码示例中那样提取非嵌套的URL。

Output of Code 代码输出

上面代码示例的输出

I've been able to achieve my goal using the solution below: 我已经可以使用以下解决方案实现目标:

$result = <<<HTML
<a href="SOME_URL">
    <a href="SOME_URL_2">
    </a>
</a>

<a href="SOME_URL3">
    <a href="SOME_URL_4">
    </a>
</a>

<a href="SOME_URL_5">
</a>
<a href="SOME_URL_6">
</a>

HTML;

$dom = new DOMDocument();
@$dom->loadHTML($result);


foreach($dom->getElementsByTagName('a') as $link) {

    $tag_html = $dom->saveHTML($link); //Get tag inner html

    if (substr_count($tag_html, "href") > 1) { //If tag contains more than one href attribute
        preg_match_all('/href="([^"]*)"/is', $tag_html, $link_output, PREG_SET_ORDER);
        $output[] = $link_output[1][1]; //Output second href
    } else { //Not nested tag
        $output[] = $link->getAttribute('href'); //Output first href
    }
}

echo "<pre>".print_r($output)."</pre>";

Output: 输出:

array
(
    [0] => SOME_URL_2
    [1] => SOME_URL_4
    [2] => SOME_URL_5
    [3] => SOME_URL_6
)

This solution works with entire html pages with mixed and/or nested content. 该解决方案适用于具有混合和/或嵌套内容的整个html页面。 It allows to capture as many nested href as needed while still capturing standard href "a" tags. 它允许捕获所需数量的嵌套href,同时仍捕获标准href“ a”标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM