简体   繁体   中英

Get Paginated Links With php and simple html dom

I have this code to try and get the pagination links using php but the result is not quiet right. could any one help me.

what I get back is just a recurring instance of the first link.

<?php
include_once('simple_html_dom.php');
function dlPage($href) {
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($curl, CURLOPT_HEADER, false);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($curl, CURLOPT_URL, $href);
    curl_setopt($curl, CURLOPT_REFERER, $href);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
    $str = curl_exec($curl);
    curl_close($curl);

    // Create a DOM object
    $dom = new simple_html_dom();
    // Load HTML from a string
    $dom->load($str);


    $Next_Link = array();
    foreach($dom->find('a[title=Next]') as $element){
        $Next_Link[] = $element->href; 
    }

    print_r($Next_Link);

    $next_page_url = $Next_Link[0];
    if($next_page_url !='') {
        echo '<br>' . $next_page_url;
        $dom->clear();
        unset($dom);

        //load the next page from the pagination to collect the next link
        dlPage($next_page_url);
    }

}

$url = 'https://www.jumia.com.gh/phones/';
$data = dlPage($url);
//print_r($data)
?>

what i want to get is
mySiteUrl/?facet_is_mpg_child=0&viewType=gridView&page=2
mySiteUrl//?facet_is_mpg_child=0&viewType=gridView&page=3

. . . to the last link in the pagination. Please help

Here it is. Look that I htmlspecialchars_decode the link. Cause the href in curl there shouldn't be an & like in xml. Should the return value of dlPage the last link in Pagination. I understood so.

<?php
include_once('simple_html_dom.php');

function dlPage($href, $already_loaded = array()) {
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($curl, CURLOPT_HEADER, false);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($curl, CURLOPT_URL, $href);
    curl_setopt($curl, CURLOPT_REFERER, $href);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
    $htmlPage = curl_exec($curl);
    curl_close($curl);

    echo "Loading From URL:" . $href . "<br/>\n";
    $already_loaded[$href] = true;

    // Create a DOM object
    $dom = file_get_html($href);
    // Load HTML from a string
    $dom->load($htmlPage);

    $next_page_url = null;
    $items = $dom->find('ul[class="osh-pagination"] li[class="item"] a[title="Next"]');

    foreach ($items as $item) {
        $link = htmlspecialchars_decode($item->href);
        if (!isset($already_loaded[$link])) {
            $next_page_url = $link;
            break;
        }
    }

    if ($next_page_url !== null) {
        $dom->clear();
        unset($dom);

        //load the next page from the pagination to collect the next link
        return dlPage($next_page_url, $already_loaded);
    }

    return $href;
}

$url = 'https://www.jumia.com.gh/phones/';
$data = dlPage($url);
echo "DATA:" . $data . "\n";

And the output is:

Loading From URL:https://www.jumia.com.gh/phones/<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=2<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=3<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=4<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=5<br/>
DATA:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM