简体   繁体   English

使用DOM从第二个HTML表中提取数据,忽略第一个表

[英]Pull data from second HTML table using DOM, ignore first table

I have the below PHP script which is run via a command prompt and it works fine if there is only one table on a page but if I have two tables on a page it will just try and pull the first one out, is there a way I can say in certain instances to ignore the first table and process the second table only? 我有以下通过命令提示符运行的PHP脚本,如果页面上只有一个表,则可以正常工作,但是如果页面上只有两个表,它将尝试拉出第一个表,是否有办法我可以说在某些情况下忽略第一个表并仅处理第二个表?

I have no control of the HTML so can't target the table by using an ID. 我无法控制HTML,因此无法使用ID定位表。

HTML HTML

<html>
</head>
...
</head>
<body>
    <table>
        <tr>
            <th>Problem Table</th>
        </tr>
        <tr>
            <td>Annoying table in the way!</td>
        </tr>
    </table>
    <hr/>
    <table>
        <tr>
            <th>ID</th>
            <th>Asset</th>
        </tr>
        <tr>
            <td>34234234</td>
            <td>Website3</td>
        </tr>
        <tr>
            <td>34234234</td>
            <td>Website4</td>
        </tr>
    </table>
</body>
</html>

PHP PHP

$dom = new DOMDocument();
$html = $dom->loadHTMLFile($url);

$dom->preserveWhiteSpace = false;

$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
$cols = $rows->item(0)->getElementsByTagName('th');
$row_headers = null;

foreach($cols AS $node) {
    $row_headers[] = $node->nodeValue;
}

$table = array();
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach($rows AS $row) {
    $cols = $row->getElementsByTagName('td');
    $row = array();
    $i = 0;
    foreach($cols AS $node) {
        if ($row_headers != null) {
            $row[$row_headers[$i]] = $node->nodeValue;
        }
        $i++;
    }
    if (!empty($row)) {
        $table[] = $row;
    }
}

I agree with @GCC404 that you should target your elements better using an ID or class as this could easily lead to mistakes. 我同意@ GCC404,您应该使用ID或类更好地定位元素,因为这很容易导致错误。

However, if you specifically want to target the last table, you just need to replace the 0 with the number of items found minus 1: 但是,如果您要专门针对最后一个表,则只需将0替换为找到的项目数减去1:

$rows = $tables->item( $tables->length - 1 )->getElementsByTagName('tr');
// etc.

When using getElementsByTagName() , you can specify an index with DOMNodelist::item . 使用getElementsByTagName() ,可以使用DOMNodelist :: item指定索引。

This should probably only be used when you have no control over the source HTML or you are sure there will always be two tables, but I'd recommend just setting an id/class for each table if you are in control of the HTML. 仅当您无法控制源HTML或确定将始终有两个表时才应使用此方法,但是如果您可以控制HTML,则建议仅为每个表设置一个ID /类。

$dom = new DOMDocument();
$html = $dom->loadHTMLFile($url);

$dom->preserveWhiteSpace = false;

$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(1)->getElementsByTagName('tr');
$cols = $rows->item(1)->getElementsByTagName('th');
$row_headers = null;

foreach($cols AS $node) {
    $row_headers[] = $node->nodeValue;
}

$table = array();
$rows = $tables->item(1)->getElementsByTagName('tr');
foreach($rows AS $row) {
    $cols = $row->getElementsByTagName('td');
    $row = array();
    $i = 0;
    foreach($cols AS $node) {
        if ($row_headers != null) {
            $row[$row_headers[$i]] = $node->nodeValue;
        }
        $i++;
    }
    if (!empty($row)) {
        $table[] = $row;
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM