there is already similar questions on stackoverflow, but none of their solutions have been working for me. I'm trying to grab a page on LoveIt.com with cURL, but it returns me a 404 error, while the url works fine in the browser :
$url = 'http://loveit.com/loves/P0D1jlFaIOzzZfZqj_bY3KV';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt ($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_REFERER,'http://loveit.com/');
Here's the header I receive :
Array ( [url] => http://loveit.com/loves/P0D1jlFaIOzzZfZqj_bY3KV [content_type] => text/html; charset=utf-8 [http_code] => 404 [header_size] => 667 [request_size] => 172 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 0 [total_time] => 0.320466 [namelookup_time] => 0.000326 [connect_time] => 0.119046 [pretransfer_time] => 0.119089 [size_upload] => 0 [size_download] => 499 [speed_download] => 1557 [speed_upload] => 0 [download_content_length] => 499 [upload_content_length] => 0 [starttransfer_time] => 0.320438 [redirect_time] => 0 [certinfo] => Array ( ) [primary_ip] => --- [primary_port] => 80 [local_ip] => --- [local_port] => 53837 [redirect_url] => )
I read that some website had protections against this kind of scripts; and I did test some solutions proposed, but none worked for me (CURLOPT_USERAGENT,CURLOPT_REFERER...)
Any ideas of what's happening here ?
I would like to backup my LoveIt account, that's why i'm making this (no exports functions and no replies from LoveIt.com about the health of the website)
I quickly checked the said page with LiveHeaders enabled and I noticed bunch of cookies set. I suspect that, since it's not "normal" url, you need to hand those cookies while being redirected otherwise you end being kicked out with 404. Use CURLOPT_COOKIEJAR
with your cURL instance at start. See: http://php.net/manual/pl/function.curl-setopt.php
I just had a similar issue with a site. In my case they were expecting a USER_AGENT to be set so anyone with this issue in the future should also check that.
You don't need to save the cookie file via chrome.
You can create a function to get this cookie, and then reuse it.
Like:
<?php
error_reporting(E_ALL);
Class Crawler{
var $cookie;
var $http_response;
var $user_agent;
function __construct($cookie){
$this->cookie = (string) $cookie;
$this->user_agent = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0';
}
function get($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $this->url);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $this->user_agent);
// Here we create the file with cookies
curl_setopt($ch, CURLOPT_COOKIEJAR, $this->cookie);
$this->http_response = curl_exec($ch);
}
function get_with_cookies($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $this->user_agent);
curl_setopt($ch, CURLOPT_COOKIEJAR, $this->cookie);
// Here we can re-use the cookie file keeping the save of the cookies
curl_setopt($ch, CURLOPT_COOKIEFILE, $this->cookie);
$this->http_response = curl_exec($ch);
}
}
$crawler = new Crawler('cookie_file_name');
// Creating cookie file
$crawler->get('uri');
// Request with the cookies
$crawler->get_with_cookies('uri');
Regards.
Thanks for your answer, so I did visit the page, saved the cookies in a cookies.txt file (with chrome extenson cookie.txt export) that I use NOT CURLOPT_COOKIEJAR, but for option CURLOPT_COOKIEFILE .
$cookiefile = './cookie.txt';
curl_setopt($curl, CURLOPT_COOKIEFILE, $cookiefile);
and now it works ! Thanks for your feedback, it was really useful.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.