I have a database of a few thousand URL's that I am checking for links on pages (end up looking for specific links) and so I am throwing the below function through a loop and every once and awhile one of the URL's is bad and then the entire program just stalls and stops running and starts building up memory used. I thought adding the CURLOPT_TIMEOUT would fix this but it didn't. Any ideas?
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_TIMEOUT => 2, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => 0, // i am sending post data
CURLOPT_POSTFIELDS => $curl_data, // this are my post vars
CURLOPT_SSL_VERIFYHOST => 0, // don't verify ssl
CURLOPT_SSL_VERIFYPEER => false, //
CURLOPT_VERBOSE => 1 //
);
$ch = curl_init($url);
curl_setopt_array($ch,$options);
$content = curl_exec($ch);
$err = curl_errno($ch);
$errmsg = curl_error($ch) ;
$header = curl_getinfo($ch);
curl_close($ch);
// $header['errno'] = $err;
// $header['errmsg'] = $errmsg;
$header['content'] = $content;
#Extract the raw URl from the current one
$scheme = parse_url($url, PHP_URL_SCHEME); //Ex: http
$host = parse_url($url, PHP_URL_HOST); //Ex: www.google.com
$raw_url = $scheme . '://' . $host; //Ex: http://www.google.com
#Replace the relative link by an absolute one
$relative = array();
$absolute = array();
#String to search
$relative[0] = '/src="\//';
$relative[1] = '/href="\//';
#String to remplace by
$absolute[0] = 'src="' . $raw_url . '/';
$absolute[1] = 'href="' . $raw_url . '/';
$source = preg_replace($relative, $absolute, $content); //Ex: src="/image/google.png" to src="http://www.google.com/image/google.png"
return $source;
curl_exec will return false if it cannot find the URL. The HTTP status code will be zero. Check the results of curl_exec and check the HTTP status code too.
$content = curl_exec($ch);
$httpStatus = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ( $content === false) {
if ($httpStatus == 0) {
$content = "link was not found";
}
}
....
The way you have it currently, the line of code
header['content'] = $content;
will get the value of false. This is not what you want.
I am using curl_exec and my code does not stall if it cannot find the url. The code keeps running. You may end up with nothing in your browser though and a message in the Firebug Console like "500 Internal Server Error". Maybe that's what you mean by stall.
So basically you don't know and just guess that the curl request is stalling.
For this answer I can only guess as well then. You might need to set one of the following curl option as well: CURLOPT_CONNECTTIMEOUT
If the connect already stalls, the other timeout setting might not be taken into account. I'm not entirely sure, but please see Why would CURL time out in 1000ms when I have set up timeout upto 3000ms? .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.