简体   繁体   English

使用xpath php和domdocument获取特定表的内部内容来抓取数据

[英]scraping data using xpath php and domdocument gettin inner content of certain table

There is a external page from where I need the data. 有一个我需要数据的外部页面。 Its some type of a list you can get with restaurant orders. 您可以通过餐厅订单获得某种类型的列表。 now this page has tables... each table has a class telling wich kind of table it is for example " deliverd orders" 现在此页面上有表格...每个表格都有一个类,用于说明表格的类型,例如“已交付的订单”

Inside these tabes there are rows and tds.. i need the td values of each row for my data array... 在这些选项卡中有行和tds ..我需要为我的数据数组每一行的td值...

so what i do.. i do a xpath query gettin the contents of the table with class status kitchen. 所以我该怎么办..我用类状态厨房执行xpath查询以获取表的内容。 This works... but now i need all the rows and tds inside this table... seperated by class for example <td class="orderode">0000</td> i need as 'ordercode' => val in my array.. so i did another loop inside the loop with another xpath query <td class="orderode">0000</td> ...但是现在我需要此表中的所有行和tds ...按类分开,例如<td class="orderode">0000</td>我需要在数组中作为'ordercode' => val ..所以我用另一个xpath查询在循环内做了另一个循环

but now i see all order codes not only of kitchen... because it parses the whole html again... i just want to do the query on the parent foreach result or something.. How can I do this? 但是现在我不仅看到了厨房的所有订购代码...因为它再次解析了整个html ...我只想对父级的foreach结果或其他内容进行查询。我该怎么办?

$result = array();
$html = $sc->login(); //curl result
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);

$classname = "order-link wide status-kitchen";
$td = $xPath->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");

foreach($td as $val){

    $classname = "code order-code";
    $td2 = $xPath->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");
    foreach($td2 as $v){

        $result[] = $v->nodeValue;
    }
}

print_r($result);

example of how the HTML looks: HTML外观的示例:

/* Order list of kitchen */ / *厨房订购清单* /

<table class="order-list">
      <tbody class="order-link wide status-kitchen" rel="#oQOP3PRN511"> // REPEAT
        <tr>
          <td class="time">17:43</td>
          <td class="time-delivery ">
            18:45           </td>
          <td class="code order-code">00000</td>
          <td>address data</td>
          <td class="distance">
                        </td>
          <td class="amount">€ 29,75</td>
        </tr>
      </tbody>
      <tbody class="order-link wide status-kitchen" rel="#oQOP3PRN511"> //REPEAT
        <tr>
          <td class="time">17:43</td>
          <td class="time-delivery ">
            18:45           </td>
          <td class="code order-code">00000</td>
          <td>address data</td>
          <td class="distance">
                        </td>
          <td class="amount">€ 29,75</td>
        </tr>
      </tbody>
</table>

/*order list deliverd */
<table class="order-list">
      <tbody class="order-link wide status-kitchen" rel="#oQOP3PRN511"> //REPEAT
        <tr>
          <td class="time">17:43</td>
          <td class="time-delivery ">
            18:45           </td>
          <td class="code order-code">00000</td>
          <td>address data</td>
          <td class="distance">
                        </td>
          <td class="amount">€ 29,75</td>
        </tr>
      </tbody>
      <tbody class="order-link wide status-kitchen" rel="#oQOP3PRN511"> //REPEAT
        <tr>
          <td class="time">17:43</td>
          <td class="time-delivery ">
            18:45           </td>
          <td class="code order-code">00000</td>
          <td>address data</td>
          <td class="distance">
                        </td>
          <td class="amount">€ 29,75</td>
        </tr>
      </tbody>

To run your second xpath query starting with a given node in the DOM, begin the query with . 要从DOM中的给定节点开始运行第二个xpath查询,请从开始查询. and pass the context node as a second parameter to query() . 并将上下文节点作为第二个参数传递给query()

Example: 例:

$td2 = $xPath->query(".//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]", $val);

You want to avoid using HTML DOM and similar things for HTML scraping, as they will not prase certain type of invalid HTML, and particularly have problems with tables. 您要避免使用HTML DOM和类似的东西进行HTML抓取,因为它们不会处理某些类型的无效HTML,尤其是表有问题。

To get all trs: 获取所有trs:

preg_match_all( '~<tr.*?>(.*?)<\/tr>~is', $page, $trs );
foreach( $trs as $tr )
{
    preg_match_all( '~<td.*?>(.*?)<\/td>~is', $tr, $tds );
    print_r( $tds );
}

This gets all TR elements, with any or no attributes and any or no inner HTML. 这将获取所有TR元素,具有任何属性或没有属性以及任何内部HTML或没有内部HTML。 The i flag means case insensitive and the s flag means that it will include \\n in . i标志表示不区分大小写,而s标志表示它将包含\\ n in。 matches. 火柴。 Then the same for TD. 然后对于TD同样。

See a class I posted here that does the same thing: 看到我在这里发布的做同样事情的课程:

Get Inner HTML - PHP 获取内部HTML-PHP

Though I have not used this for years, I am not sure on the func. 尽管我已经好几年没有使用它了,但是我不确定该使用哪个功能。 I just use reg ex stand alone. 我只是单独使用reg ex。

UPDATE : Using the above class: 更新 :使用上面的类:

$c = new HTMLQuery( $html );
$tbs = $c->getElements( 'tbody', 'class', 'order-link wide status-kitchen' );
print_r( $tbs );
// you could then call a new HTMLQuery and query trs, etc., or:
foreach( $tbs as $tb )
{
    preg_match_all( '~<tr.*?>(.*?)<\/tr>~is', $tb, $trs );
    foreach( $trs as $tr )
    {
        preg_match_all( '~<td.*?>(.*?)<\/td>~is', $tr, $tds );
        print_r( $tds );
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM