简体   繁体   中英

XPath for td/th based on tr count

Using XPath to webscrape.

The structure is:

<table>
  <tbody>
     <tr>
        <th>
        <td>

but one of those tr has contains just one th or one td.

<table>
      <tbody>
         <tr>
            <th>

So I just want to scrape if TR contains two tags inside it. I am giving the path

 $route = $path->query("//table[count(tr) > 1]//tr/th");

or

 $route = $path->query("//table[count(tr) > 1]//tr/td");

But it's not working.

I am giving the orjinal table's links here. First table's last two TR is has just one TD. That is causing the problem. And 2nd or 3rd table has same issue as well.

https://www.daiwahouse.co.jp/mansion/kanto/tokyo/y35/gaiyo.html

      $route = $path->query("//tr[count(*) >= 2]/th");
      foreach ($route as $th){
          $property[] = trim($th->nodeValue);
      }

      $route = $path->query("//tr[count(*) >= 2]/td");
      foreach ($route as $td){
          $value[] = trim($td->nodeValue);
      }

I am trying to select TH and TD at the same time. BUT if TR has contains one TD then it caunsing the problem. Because in the and TD count and TH count not same I am scraping more TD then the TH

This XPath,

//table[count(.//tr) > 1]/th

will select all th elements within all table elements that have more than one tr descendent (regardless of whether tbody is present).


This XPath,

//tr[count(*) > 1]/*

will select all children of tr elements with more than one child.


This XPath,

//tr[count(th) = count(td)]/*

will select all children of tr elements where the number of th children equals the number of td children.


OP posted a link to the site. The root element is in the xmlns="http://www.w3.org/1999/xhtml" namespace.

See How does XPath deal with XML namespaces?

If I understand correctly, you want th elements in tr s that contain two elements? I think that this is what you need:

//th[count(../*) = 2]

I've included a more explicit path in my answer with a or statement to count TH and TD elements

$html = '
  <html>
    <body>
      <table>
        <tbody>
          <tr>
            <th>I am Included</th>
            <td>I am a column</td>
          </tr>
        </tbody>
      </table>
      <table>
        <tbody>
          <tr>
            <th>I am ignored</th>
          </tr>
        </tbody>
      </table>
      <table>
        <tbody>
          <tr>
            <th>I am also Included</th>
            <td>I am a column</td>
          </tr>
        </tbody>
      </table>
    </body>
  </html>
';

$doc = new DOMDocument();
$doc->loadHTML( $html );

$xpath = new DOMXPath( $doc );
$result = $xpath->query("//table[ count( tbody/tr/td | tbody/tr/th ) > 1 ]/tbody/tr");

foreach( $result as $node )
{
  var_dump( $doc->saveHTML( $node ) );
}

// string(88) "<tr><th>I am Included</th><td>I am a column</td></tr>"
// string(93) "<tr><th>I am also Included</th><td>I am a column</td></tr>"

You can also use this for any depth descendants

//table[ count( descendant::td | descendant::th ) > 1]//tr

Change the xpath after the condition (square bracketed part) to change what you return.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM