简体   繁体   English

PHP cURL搜寻器不会获取所有数据

[英]PHP cURL crawler doesn't fetch all data

I'm trying to write my first crawler by using PHP with cURL library. 我正在尝试通过将cURL库与PHP一起使用来编写我的第一个爬虫。 My aim is to fetch data from one site systematically, which means that the code doesn't follow all hyperlinks on the given site but only specific links. 我的目标是系统地从一个站点获取数据,这意味着代码不会遵循给定站点上的所有超链接,而只会遵循特定的链接。

Logic of my code is to go to the main page and get links for several categories and store those in an array. 我的代码的逻辑是转到主页并获取几个类别的链接并将它们存储在数组中。 Once it's done the crawler goes to those category sites on the page and looks if the category has more than one pages. 完成后,搜寻器将转到页面上的这些类别站点,并查看该类别是否包含多个页面。 If so, it stores subpages also in another array. 如果是这样,它将子页面也存储在另一个数组中。 Finally I merge the arrays to get all the links for sites that needs to be crawled and start to fetch required data. 最后,我合并数组以获取需要爬网的所有链接,并开始获取所需的数据。

I call the below function to start a cURL session and fetch data to a variable, which I pass to a DOM object later and parse it with Xpath. 我调用下面的函数来启动cURL会话并将数据获取到一个变量,稍后将其传递给DOM对象并用Xpath对其进行解析。 I store cURL total_time and http_code in a log file. 我将cURL total_time和http_code存储在日志文件中。

The problem is that the crawler runs for 5-6 minutes then stops and doesn't fetch all required links for sub-pages. 问题是搜寻器运行了5-6分钟,然后停止并且没有获取子页面的所有必需链接。 I print content of arrays to check result. 我打印数组的内容以检查结果。 I can't see any http error in my log, all sites give a http 200 status code. 我在日志中看不到任何http错误,所有站点都给出了http 200状态代码。 I can't see any PHP related error even if I turn on PHP debug on my localhost. 即使在本地主机上打开PHP调试,我也看不到任何与PHP相关的错误。

I assume that the site blocks my crawler after few minutes because of too many requests but I'm not sure. 我认为该网站由于请求过多而在几分钟后阻止了我的搜寻器,但我不确定。 Is there any way to get a more detailed debug? 有什么办法可以进行更详细的调试吗? Do you think that PHP is adequate for this type of activity because I wan't to use the same mechanism to fetch content from more than 100 other sites later on? 您是否认为PHP足以满足此类活动的需要,因为我以后将不使用相同的机制从100多个其他站点中获取内容?

My cURL code is as follows: 我的cURL代码如下:

function get_url($url)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
    curl_setopt($ch, CURLOPT_URL, $url);
    $data = curl_exec($ch);
    $info = curl_getinfo($ch);  
    $logfile = fopen("crawler.log","a");
    echo fwrite($logfile,'Page ' . $info['url'] . ' fetched in ' . $info['total_time'] . ' seconds. Http status code: ' . $info['http_code'] . "\n");
    fclose($logfile);
    curl_close($ch);

    return $data;
}

// Start to crawle main page.

$site2crawl = 'http://www.site.com/';

$dom = new DOMDocument();
@$dom->loadHTML(get_url($site2crawl));
$xpath = new DomXpath($dom);

Use set_time_limit to extend the amount of time your script can run for. 使用set_time_limit可以延长脚本可以运行的时间。 That is why you are getting Fatal error: Maximum execution time of 30 seconds exceeded in your error log. 这就是为什么出现Fatal error: Maximum execution time of 30 seconds exceeded错误日志中的Fatal error: Maximum execution time of 30 seconds exceeded

do you need to run this on a server? 您是否需要在服务器上运行此程序? If not, you should try the cli version of php - it is exempt from common restrictions 如果没有,您应该尝试cli版本的php-它不受一般限制

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM