简体   繁体   English

刮掉“下一页”问题

[英]Scraping 'next' page issue

I am trying to scrape product data by product section from a Zen-cart store using Simple HTML DOM. 我正在尝试使用简单HTML DOM从Zen-cart商店中按产品部分抓取产品数据。 I can scrape data from the first page fine but when I try to load the 'next' page of products the site returns the index.php landing page. 我可以从第一页抓取数据,但是当我尝试加载产品的“下一页”时,该网站将返回index.php登录页面。

If I use the function directly with *http://URLxxxxxxxxxx.com/index.php?main_page=index&cPath=36&sort=20a&page=2* it scrapes the product information from page 2 fine. 如果我直接将该函数与* http://URLxxxxxxxxxx.com/index.php?main_page = index&cPath = 36&sort = 20a&page = 2 *一起使用,则会从第2页开始抓取产品信息。

The same thing occurs if I use cURL. 如果我使用cURL,也会发生相同的情况。

getPrices('http://URLxxxxxxxxxx.com/index.php?main_page=index&cPath=36');

   function getPrices($sectionURL) {

$opts = array('http' => array('method' => "GET", 'header' => "Accept-language: en\r\n" . "User-Agent:    Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6\r\n" . "Cookie:   zenid=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\r\n"));
$context = stream_context_create($opts);

$html = file_get_contents($sectionURL, false, $context);
$dom = new simple_html_dom();
$dom -> load($html);

//Do cool stuff here with information from page.. product name, image, price and more info URL

if ($nextPage = $dom -> find('a[title= Next Page ]', 0)) {
    $nextPageURL = $nextPage -> href;
    echo $nextPageURL;
    $dom -> clear();
    unset($dom);
    getPrices($nextPageURL);
} else {
    echo "\nNo more pages to scrape!!";
    $dom -> clear();
    unset($dom);
}

} }

Any ideas on how to fix this problem? 关于如何解决此问题的任何想法?

I see lots of potential culprits. 我看到很多潜在的罪魁祸首。 You're not keeping track of cookies, or setting referer and there's a good chance simple_html_dom is letting you down. 您没有跟踪cookie或设置引荐来源,并且很有可能simple_html_dom让您失望。

My recommendation is to proxy your requests through fiddler or charles and make sure they look the way they do coming from a browser. 我的建议是通过提琴手查尔斯代理您的请求,并确保它们看起来像来自浏览器的方式。

原来,传递给循环中的函数的下一页URL正在传递&而不是&,并且file_get_contents不喜欢它。

$sectionURL = str_replace( "&", "&", urldecode(trim($sectionURL)) );

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM