简体   繁体   English


[英]PHP stalls on loop of curl function with bad url

I have a database of a few thousand URL's that I am checking for links on pages (end up looking for specific links) and so I am throwing the below function through a loop and every once and awhile one of the URL's is bad and then the entire program just stalls and stops running and starts building up memory used. 我有一个包含数千个URL的数据库,我正在检查页面上的链接(最终查找特定的链接),因此我通过循环抛出以下函数,并且每隔一段时间,其中一个URL是不好的,然后整个程序停顿并停止运行,并开始建立已使用的内存。 I thought adding the CURLOPT_TIMEOUT would fix this but it didn't. 我以为添加CURLOPT_TIMEOUT可以解决此问题,但没有解决。 Any ideas? 有任何想法吗?

$options = array(
    CURLOPT_RETURNTRANSFER => true,         // return web page
    CURLOPT_HEADER         => false,        // don't return headers
    CURLOPT_FOLLOWLOCATION => true,         // follow redirects
    CURLOPT_ENCODING       => "",           // handle all encodings
    CURLOPT_USERAGENT      =>  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20080311 Firefox/'",     // who am i
    CURLOPT_AUTOREFERER    => true,         // set referer on redirect
    CURLOPT_TIMEOUT        => 2,          // timeout on response
    CURLOPT_MAXREDIRS      => 10,           // stop after 10 redirects
    CURLOPT_POST            => 0,            // i am sending post data
       CURLOPT_POSTFIELDS     => $curl_data,    // this are my post vars
    CURLOPT_SSL_VERIFYHOST => 0,            // don't verify ssl
    CURLOPT_SSL_VERIFYPEER => false,        //
    CURLOPT_VERBOSE        => 1                //

$ch      = curl_init($url);
$content = curl_exec($ch);
$err     = curl_errno($ch);
$errmsg  = curl_error($ch) ;
$header  = curl_getinfo($ch);

//  $header['errno']   = $err;
//  $header['errmsg']  = $errmsg;
$header['content'] = $content;

#Extract the raw URl from the current one
$scheme = parse_url($url, PHP_URL_SCHEME); //Ex: http
$host = parse_url($url, PHP_URL_HOST); //Ex: www.google.com
$raw_url = $scheme . '://' . $host; //Ex: http://www.google.com

#Replace the relative link by an absolute one
$relative = array();
$absolute = array();

#String to search
$relative[0] = '/src="\//';
$relative[1] = '/href="\//';

#String to remplace by
$absolute[0] = 'src="' . $raw_url . '/';
$absolute[1] = 'href="' . $raw_url . '/';

$source = preg_replace($relative, $absolute, $content); //Ex: src="/image/google.png" to src="http://www.google.com/image/google.png"

return $source;

curl_exec will return false if it cannot find the URL. 如果找不到URL,curl_exec将返回false。 The HTTP status code will be zero. HTTP状态代码将为零。 Check the results of curl_exec and check the HTTP status code too. 检查curl_exec的结果,并检查HTTP状态代码。

$content = curl_exec($ch);
$httpStatus = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ( $content === false) {
   if ($httpStatus == 0) {
    $content = "link was not found";

The way you have it currently, the line of code 当前的方式,代码行

header['content'] = $content;

will get the value of false. 将获得false的值。 This is not what you want. 这不是您想要的。

I am using curl_exec and my code does not stall if it cannot find the url. 我正在使用curl_exec,如果找不到URL,我的代码也不会停顿。 The code keeps running. 该代码保持运行。 You may end up with nothing in your browser though and a message in the Firebug Console like "500 Internal Server Error". 您可能最终在浏览器中什么也没有,在Firebug控制台中看到一条消息,例如“ 500 Internal Server Error”。 Maybe that's what you mean by stall. 也许这就是您所说的失速。

So basically you don't know and just guess that the curl request is stalling. 因此,基本上您不知道,只是猜测curl请求正在暂停。

For this answer I can only guess as well then. 对于这个答案,我也只能猜测。 You might need to set one of the following curl option as well: CURLOPT_CONNECTTIMEOUT 您可能还需要设置以下curl选项之一: CURLOPT_CONNECTTIMEOUT

If the connect already stalls, the other timeout setting might not be taken into account. 如果连接已停止,则可能不会考虑其他超时设置。 I'm not entirely sure, but please see Why would CURL time out in 1000ms when I have set up timeout upto 3000ms? 我不确定,但是请参阅设置超时时间到3000ms时为什么CURL会在1000ms超时? .

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM