如何通过 html Content 获取 href 和 text 内容

Question

I am want to get content and url including all other td data.我想获取内容和 url 包括所有其他 td 数据。

my code:我的代码：

$context = stream_context_create(
    array(
        "http" => array(
            "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
        )
    )
);

$htmlContent = file_get_contents("https://www.iana.org/domains/root/db", false, $context);
    
$DOM = new DOMDocument();
$DOM->loadHTML($htmlContent);

$FirstdTable = $DOM->getElementsByTagName('table')->item(0);


$Header = $FirstdTable->getElementsByTagName('th');
$Detail = $FirstdTable->getElementsByTagName('td');

//#Get header name of the table
foreach($Header as $NodeHeader) 
{
    $aDataTableHeaderHTML[] = trim($NodeHeader->textContent);
}

//#Get row data/detail table without header name as key
$i = 0;
$j = 0;
foreach($Detail as $sNodeDetail)
{
   
    $aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent);
    $i = $i + 1;
    $j = $i % count($aDataTableHeaderHTML) == 0 ? $j + 1 : $j;
}

current output:当前 output：

Array
(
    [0] => Array
        (
            [0] => .aaa
            [1] => generic
            [2] => American Automobile Association, Inc.
        )

    [1] => Array
        (
            [0] => .aarp
            [1] => generic
            [2] => AARP
        )

    [2] => Array
        (
            [0] => .abarth
            [1] => generic
            [2] => Fiat Chrysler Automobiles N.V.
        )
}

here i am want as:我想在这里：

Array
(
    [0] => Array
        (
            [0] => .aaa
            [1] => generic
            [2] => American Automobile Association, Inc.
            [3] => https://www.iana.org/domains/root/db/aaa.html
        )

    [1] => Array
        (
            [0] => .aarp
            [1] => generic
            [2] => AARP
            [3] => https://www.iana.org/domains/root/db/aarp.html
        )

    [2] => Array
        (
            [0] => .abarth
            [1] => generic
            [2] => Fiat Chrysler Automobiles N.V.
            [3] => https://www.iana.org/domains/root/db/abarth.html
        )
}

Answer 1

Currently, you're just getting all the text content within all <td> 's.目前，您只是获取所有<td>中的所有文本内容。 And it's not going to include the link inside the anchor tags.它不会在锚标签内包含链接。 To do so, you'll need to dig deeper into the <td> .为此，您需要深入挖掘<td> 。

Here's one way to do it using xpath :这是使用xpath的一种方法：

$xpath = new DOMXpath($DOM);
$base = 'https://www.iana.org/';
foreach($Detail as $sNodeDetail)
{
    $aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent);
    if ($link = $xpath->evaluate("string(./span[contains(@class, 'domain')]/a/@href)", $sNodeDetail)) {
        $aDataTableDetailHTML[$j][] = "{$base}{$link}";
    }
    $i = $i + 1;
    $j = $i % count($aDataTableHeaderHTML) == 0 ? $j + 1 : $j;
}

Basically the query just extract the href value if the current <td> in the iteration has <span class="domain tld"><a href="xxxx">xxx</a></span> and get the href value.基本上，如果迭代中的当前<td>具有<span class="domain tld"><a href="xxxx">xxx</a></span> ，则查询只是提取href值并获取href值。

Another way is to iterate each <tr> instead of each <td> :另一种方法是迭代每个<tr>而不是每个<td> ：

$aDataTableDetailHTML = [];
$DOM = new DOMDocument();
$DOM->loadHTML($htmlContent);
$xpath = new DOMXpath($DOM);
$base = 'https://www.iana.org/';
foreach($xpath->query('//table[@id="tld-table"]/tbody/tr') as $row) {
    $domain = trim($xpath->evaluate("string(./td[1])", $row));
    $type = $xpath->evaluate("string(./td[2])", $row);
    $tld_manager = $xpath->evaluate("string(./td[3])", $row);
    $url = $xpath->evaluate("string(./td[1]/span/a/@href)", $row);
    $aDataTableDetailHTML[] = [$domain, $type, $tld_manager, "{$base}{$url}"];
}

如何通过 html Content 获取 href 和 text 内容

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-07-16 08:05:06

如何通过 html Content 获取 href 和 text 内容

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-07-16 08:05:06

解决方案1
1 已采纳 2020-07-16 08:05:06