简体   繁体   中英

PHP's DOMXpath fails to return the complete set of children nodes

I have 9 table rows but only three get returned when I query the top level node with DOMXpath.

<table class="something">
    <tbody>
        <tr>
            <td class="label">One</td>
            <td>111111</td>
        </tr>
        <tr>
            <td class="label">Two</td>
            <td>1454</td>
        </tr>    
        <tr>
            <td class="label">Three</td>
            <td></td>
        </tr>
        <tr>
            <td class="label">Four</td>
            <td>0</td>
        </tr>
        <tr>
            <td class="label">Five</td>
            <td>45</td>
        </tr>
        <tr>
            <td class="label">Six</td>
            <td>45</td>
        </tr>
        <tr>
            <td class="label">Seven</td>
            <td>5</td>
        </tr>
        <tr>
            <td class="label">Eight</td>
            <td>0</td>
        </tr>
        <tr>
            <td class="label">Nine</td>
            <td>0</td>
        </tr>
    </tbody>
</table>

I use DOMDocument to load the HTML.

$doc = new DOMDocument;
@$doc->loadHTML($htmlData);
$xpath = new DOMXpath($doc);
$tableRows = $xpath->query('//table[@class="something"]//tr');

Unfortunately, the complete set of table rows are not being returned -- only the first three. I'm guess that the empty element <td></td> is somehow throwing off the xpath parser. Is there a solution to this?

EDIT:

I'm trying another approach without using DOMXpath.

    $request = drupal_http_request($url);

    $data = $request->data;

    $doc = new DOMDocument;
    @$doc->loadHTML($data);
    $tables = $doc->getElementsByTagName('table');
    $rows = $tables->item(2)->getElementsByTagName('tr');
    $output = '';
    foreach($rows as $row) {
        $cols = $row->getElementsByTagName('td');
        foreach($cols as $col){
            $output .= $col->nodeValue . '<br/>';
        }
    }
    return $output;

Both approaches output this HTML:

<div class="content">
    One<br>111111<br>Two<br>1454<br>Three<br><br>
</div>

In the first example $tableRows->length is 3 which is consistent with the output but not the markup which has 9 rows.

I'm scraping a webpage that has invalid, corrupted, dirty HTML. DOMDocument likes clean and organized (I guess). Instead I'm using the simple_html_dom.php script to parse the HTML and it works fine.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM