I have 9 table rows but only three get returned when I query the top level node with DOMXpath.
<table class="something">
<tbody>
<tr>
<td class="label">One</td>
<td>111111</td>
</tr>
<tr>
<td class="label">Two</td>
<td>1454</td>
</tr>
<tr>
<td class="label">Three</td>
<td></td>
</tr>
<tr>
<td class="label">Four</td>
<td>0</td>
</tr>
<tr>
<td class="label">Five</td>
<td>45</td>
</tr>
<tr>
<td class="label">Six</td>
<td>45</td>
</tr>
<tr>
<td class="label">Seven</td>
<td>5</td>
</tr>
<tr>
<td class="label">Eight</td>
<td>0</td>
</tr>
<tr>
<td class="label">Nine</td>
<td>0</td>
</tr>
</tbody>
</table>
I use DOMDocument to load the HTML.
$doc = new DOMDocument;
@$doc->loadHTML($htmlData);
$xpath = new DOMXpath($doc);
$tableRows = $xpath->query('//table[@class="something"]//tr');
Unfortunately, the complete set of table rows are not being returned -- only the first three. I'm guess that the empty element <td></td>
is somehow throwing off the xpath parser. Is there a solution to this?
EDIT:
I'm trying another approach without using DOMXpath.
$request = drupal_http_request($url);
$data = $request->data;
$doc = new DOMDocument;
@$doc->loadHTML($data);
$tables = $doc->getElementsByTagName('table');
$rows = $tables->item(2)->getElementsByTagName('tr');
$output = '';
foreach($rows as $row) {
$cols = $row->getElementsByTagName('td');
foreach($cols as $col){
$output .= $col->nodeValue . '<br/>';
}
}
return $output;
Both approaches output this HTML:
<div class="content">
One<br>111111<br>Two<br>1454<br>Three<br><br>
</div>
In the first example $tableRows->length is 3 which is consistent with the output but not the markup which has 9 rows.
I'm scraping a webpage that has invalid, corrupted, dirty HTML. DOMDocument likes clean and organized (I guess). Instead I'm using the simple_html_dom.php script to parse the HTML and it works fine.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.