简体   繁体   English

基于tr计数的td / th的XPath

[英]XPath for td/th based on tr count

Using XPath to webscrape. 使用XPath进行webscrape。

The structure is: 结构是:

<table>
  <tbody>
     <tr>
        <th>
        <td>

but one of those tr has contains just one th or one td. 但其中一个tr只包含一个或一个td。

<table>
      <tbody>
         <tr>
            <th>

So I just want to scrape if TR contains two tags inside it. 所以我只想在TR里面包含两个标签。 I am giving the path 我正在走这条路

 $route = $path->query("//table[count(tr) > 1]//tr/th");

or 要么

 $route = $path->query("//table[count(tr) > 1]//tr/td");

But it's not working. 但它不起作用。

I am giving the orjinal table's links here. 我在这里给出orjinal表的链接。 First table's last two TR is has just one TD. 第一个表的最后两个TR只有一个TD。 That is causing the problem. 这导致了问题。 And 2nd or 3rd table has same issue as well. 第2或第3表也有同样的问题。

https://www.daiwahouse.co.jp/mansion/kanto/tokyo/y35/gaiyo.html https://www.daiwahouse.co.jp/mansion/kanto/tokyo/y35/gaiyo.html

      $route = $path->query("//tr[count(*) >= 2]/th");
      foreach ($route as $th){
          $property[] = trim($th->nodeValue);
      }

      $route = $path->query("//tr[count(*) >= 2]/td");
      foreach ($route as $td){
          $value[] = trim($td->nodeValue);
      }

I am trying to select TH and TD at the same time. 我试图同时选择TH和TD。 BUT if TR has contains one TD then it caunsing the problem. 但是如果TR包含一个TD,那么就会发现问题。 Because in the and TD count and TH count not same I am scraping more TD then the TH 因为在TD计数和TH计数不相同的情况下,我正在刮取更多TD然后TH

This XPath, 这个XPath,

//table[count(.//tr) > 1]/th

will select all th elements within all table elements that have more than one tr descendent (regardless of whether tbody is present). 将选择所有th所有内的元件table具有元素多于一个tr后代(不管是否tbody存在)。


This XPath, 这个XPath,

//tr[count(*) > 1]/*

will select all children of tr elements with more than one child. 将选择具有多个子元素的tr元素的所有子元素。


This XPath, 这个XPath,

//tr[count(th) = count(td)]/*

will select all children of tr elements where the number of th children equals the number of td children. 将选择tr元素的所有子元素,其中th元素的数量等于td子元素的数量。


OP posted a link to the site. OP发布了该网站的链接。 The root element is in the xmlns="http://www.w3.org/1999/xhtml" namespace. 根元素位于xmlns="http://www.w3.org/1999/xhtml"命名空间中。

See How does XPath deal with XML namespaces? 请参阅XPath如何处理XML命名空间?

If I understand correctly, you want th elements in tr s that contain two elements? 如果我理解正确的话,你想th中的元素tr s表示包含两个元素? I think that this is what you need: 我认为这就是你需要的:

//th[count(../*) = 2]

I've included a more explicit path in my answer with a or statement to count TH and TD elements 我在答案中包含了一个更明确的路径,其中包含一个or语句来计算TH和TD元素

$html = '
  <html>
    <body>
      <table>
        <tbody>
          <tr>
            <th>I am Included</th>
            <td>I am a column</td>
          </tr>
        </tbody>
      </table>
      <table>
        <tbody>
          <tr>
            <th>I am ignored</th>
          </tr>
        </tbody>
      </table>
      <table>
        <tbody>
          <tr>
            <th>I am also Included</th>
            <td>I am a column</td>
          </tr>
        </tbody>
      </table>
    </body>
  </html>
';

$doc = new DOMDocument();
$doc->loadHTML( $html );

$xpath = new DOMXPath( $doc );
$result = $xpath->query("//table[ count( tbody/tr/td | tbody/tr/th ) > 1 ]/tbody/tr");

foreach( $result as $node )
{
  var_dump( $doc->saveHTML( $node ) );
}

// string(88) "<tr><th>I am Included</th><td>I am a column</td></tr>"
// string(93) "<tr><th>I am also Included</th><td>I am a column</td></tr>"

You can also use this for any depth descendants 您也可以将此用于任何深度后代

//table[ count( descendant::td | descendant::th ) > 1]//tr

Change the xpath after the condition (square bracketed part) to change what you return. 在条件(方括号部分)之后更改xpath以更改返回的内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM