Using XPath to webscrape.
The structure is:
<table>
<tbody>
<tr>
<th>
<td>
but one of those tr has contains just one th or one td.
<table>
<tbody>
<tr>
<th>
So I just want to scrape if TR
contains two tags inside it. I am giving the path
$route = $path->query("//table[count(tr) > 1]//tr/th");
or
$route = $path->query("//table[count(tr) > 1]//tr/td");
But it's not working.
I am giving the orjinal table's links here. First table's last two TR is has just one TD. That is causing the problem. And 2nd or 3rd table has same issue as well.
https://www.daiwahouse.co.jp/mansion/kanto/tokyo/y35/gaiyo.html
$route = $path->query("//tr[count(*) >= 2]/th");
foreach ($route as $th){
$property[] = trim($th->nodeValue);
}
$route = $path->query("//tr[count(*) >= 2]/td");
foreach ($route as $td){
$value[] = trim($td->nodeValue);
}
I am trying to select TH and TD at the same time. BUT if TR has contains one TD then it caunsing the problem. Because in the and TD count and TH count not same I am scraping more TD then the TH
This XPath,
//table[count(.//tr) > 1]/th
will select all th
elements within all table
elements that have more than one tr
descendent (regardless of whether tbody
is present).
This XPath,
//tr[count(*) > 1]/*
will select all children of tr
elements with more than one child.
This XPath,
//tr[count(th) = count(td)]/*
will select all children of tr
elements where the number of th
children equals the number of td
children.
OP posted a link to the site. The root element is in the xmlns="http://www.w3.org/1999/xhtml"
namespace.
If I understand correctly, you want th
elements in tr
s that contain two elements? I think that this is what you need:
//th[count(../*) = 2]
I've included a more explicit path in my answer with a or
statement to count TH and TD elements
$html = '
<html>
<body>
<table>
<tbody>
<tr>
<th>I am Included</th>
<td>I am a column</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<th>I am ignored</th>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<th>I am also Included</th>
<td>I am a column</td>
</tr>
</tbody>
</table>
</body>
</html>
';
$doc = new DOMDocument();
$doc->loadHTML( $html );
$xpath = new DOMXPath( $doc );
$result = $xpath->query("//table[ count( tbody/tr/td | tbody/tr/th ) > 1 ]/tbody/tr");
foreach( $result as $node )
{
var_dump( $doc->saveHTML( $node ) );
}
// string(88) "<tr><th>I am Included</th><td>I am a column</td></tr>"
// string(93) "<tr><th>I am also Included</th><td>I am a column</td></tr>"
You can also use this for any depth descendants
//table[ count( descendant::td | descendant::th ) > 1]//tr
Change the xpath after the condition (square bracketed part) to change what you return.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.