简体   繁体   中英

scraping data using xpath php and domdocument gettin inner content of certain table

There is a external page from where I need the data. Its some type of a list you can get with restaurant orders. now this page has tables... each table has a class telling wich kind of table it is for example " deliverd orders"

Inside these tabes there are rows and tds.. i need the td values of each row for my data array...

so what i do.. i do a xpath query gettin the contents of the table with class status kitchen. This works... but now i need all the rows and tds inside this table... seperated by class for example <td class="orderode">0000</td> i need as 'ordercode' => val in my array.. so i did another loop inside the loop with another xpath query

but now i see all order codes not only of kitchen... because it parses the whole html again... i just want to do the query on the parent foreach result or something.. How can I do this?

$result = array();
$html = $sc->login(); //curl result
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);

$classname = "order-link wide status-kitchen";
$td = $xPath->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");

foreach($td as $val){

    $classname = "code order-code";
    $td2 = $xPath->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");
    foreach($td2 as $v){

        $result[] = $v->nodeValue;
    }
}

print_r($result);

example of how the HTML looks:

/* Order list of kitchen */

<table class="order-list">
      <tbody class="order-link wide status-kitchen" rel="#oQOP3PRN511"> // REPEAT
        <tr>
          <td class="time">17:43</td>
          <td class="time-delivery ">
            18:45           </td>
          <td class="code order-code">00000</td>
          <td>address data</td>
          <td class="distance">
                        </td>
          <td class="amount">€ 29,75</td>
        </tr>
      </tbody>
      <tbody class="order-link wide status-kitchen" rel="#oQOP3PRN511"> //REPEAT
        <tr>
          <td class="time">17:43</td>
          <td class="time-delivery ">
            18:45           </td>
          <td class="code order-code">00000</td>
          <td>address data</td>
          <td class="distance">
                        </td>
          <td class="amount">€ 29,75</td>
        </tr>
      </tbody>
</table>

/*order list deliverd */
<table class="order-list">
      <tbody class="order-link wide status-kitchen" rel="#oQOP3PRN511"> //REPEAT
        <tr>
          <td class="time">17:43</td>
          <td class="time-delivery ">
            18:45           </td>
          <td class="code order-code">00000</td>
          <td>address data</td>
          <td class="distance">
                        </td>
          <td class="amount">€ 29,75</td>
        </tr>
      </tbody>
      <tbody class="order-link wide status-kitchen" rel="#oQOP3PRN511"> //REPEAT
        <tr>
          <td class="time">17:43</td>
          <td class="time-delivery ">
            18:45           </td>
          <td class="code order-code">00000</td>
          <td>address data</td>
          <td class="distance">
                        </td>
          <td class="amount">€ 29,75</td>
        </tr>
      </tbody>

To run your second xpath query starting with a given node in the DOM, begin the query with . and pass the context node as a second parameter to query() .

Example:

$td2 = $xPath->query(".//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]", $val);

You want to avoid using HTML DOM and similar things for HTML scraping, as they will not prase certain type of invalid HTML, and particularly have problems with tables.

To get all trs:

preg_match_all( '~<tr.*?>(.*?)<\/tr>~is', $page, $trs );
foreach( $trs as $tr )
{
    preg_match_all( '~<td.*?>(.*?)<\/td>~is', $tr, $tds );
    print_r( $tds );
}

This gets all TR elements, with any or no attributes and any or no inner HTML. The i flag means case insensitive and the s flag means that it will include \\n in . matches. Then the same for TD.

See a class I posted here that does the same thing:

Get Inner HTML - PHP

Though I have not used this for years, I am not sure on the func. I just use reg ex stand alone.

UPDATE : Using the above class:

$c = new HTMLQuery( $html );
$tbs = $c->getElements( 'tbody', 'class', 'order-link wide status-kitchen' );
print_r( $tbs );
// you could then call a new HTMLQuery and query trs, etc., or:
foreach( $tbs as $tb )
{
    preg_match_all( '~<tr.*?>(.*?)<\/tr>~is', $tb, $trs );
    foreach( $trs as $tr )
    {
        preg_match_all( '~<td.*?>(.*?)<\/td>~is', $tr, $tds );
        print_r( $tds );
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM