[英]Get Paginated Links With php and simple html dom
I have this code to try and get the pagination links using php but the result is not quiet right. 我有这段代码尝试使用php获取分页链接,但结果并不安静。 could any one help me.
谁能帮我。
what I get back is just a recurring instance of the first link. 我得到的只是第一个链接的重复实例。
<?php
include_once('simple_html_dom.php');
function dlPage($href) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load($str);
$Next_Link = array();
foreach($dom->find('a[title=Next]') as $element){
$Next_Link[] = $element->href;
}
print_r($Next_Link);
$next_page_url = $Next_Link[0];
if($next_page_url !='') {
echo '<br>' . $next_page_url;
$dom->clear();
unset($dom);
//load the next page from the pagination to collect the next link
dlPage($next_page_url);
}
}
$url = 'https://www.jumia.com.gh/phones/';
$data = dlPage($url);
//print_r($data)
?>
what i want to get is 我想要得到的是
mySiteUrl/?facet_is_mpg_child=0&viewType=gridView&page=2
mySiteUrl//?facet_is_mpg_child=0&viewType=gridView&page=3
. 。 .
。 .
。 to the last link in the pagination.
到分页中的最后一个链接。 Please help
请帮忙
Here it is. 这里是。 Look that I htmlspecialchars_decode the link.
看看我htmlspecialchars_decode链接。 Cause the href in curl there shouldn't be an & like in xml.
导致curl中的href不应在xml中出现&。 Should the return value of dlPage the last link in Pagination.
返回值dlPage应该是分页中的最后一个链接。 I understood so.
我明白了
<?php
include_once('simple_html_dom.php');
function dlPage($href, $already_loaded = array()) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$htmlPage = curl_exec($curl);
curl_close($curl);
echo "Loading From URL:" . $href . "<br/>\n";
$already_loaded[$href] = true;
// Create a DOM object
$dom = file_get_html($href);
// Load HTML from a string
$dom->load($htmlPage);
$next_page_url = null;
$items = $dom->find('ul[class="osh-pagination"] li[class="item"] a[title="Next"]');
foreach ($items as $item) {
$link = htmlspecialchars_decode($item->href);
if (!isset($already_loaded[$link])) {
$next_page_url = $link;
break;
}
}
if ($next_page_url !== null) {
$dom->clear();
unset($dom);
//load the next page from the pagination to collect the next link
return dlPage($next_page_url, $already_loaded);
}
return $href;
}
$url = 'https://www.jumia.com.gh/phones/';
$data = dlPage($url);
echo "DATA:" . $data . "\n";
And the output is: 输出为:
Loading From URL:https://www.jumia.com.gh/phones/<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=2<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=3<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=4<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=5<br/>
DATA:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.