简体   繁体   中英

cURL returns 404 while the page is found in browser

there is already similar questions on stackoverflow, but none of their solutions have been working for me. I'm trying to grab a page on LoveIt.com with cURL, but it returns me a 404 error, while the url works fine in the browser :

        $url = 'http://loveit.com/loves/P0D1jlFaIOzzZfZqj_bY3KV';

        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $url);
        curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
        curl_setopt ($curl, CURLOPT_HEADER, false);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_REFERER,'http://loveit.com/');

Here's the header I receive :

Array ( [url] => http://loveit.com/loves/P0D1jlFaIOzzZfZqj_bY3KV [content_type] => text/html; charset=utf-8 [http_code] => 404 [header_size] => 667 [request_size] => 172 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 0 [total_time] => 0.320466 [namelookup_time] => 0.000326 [connect_time] => 0.119046 [pretransfer_time] => 0.119089 [size_upload] => 0 [size_download] => 499 [speed_download] => 1557 [speed_upload] => 0 [download_content_length] => 499 [upload_content_length] => 0 [starttransfer_time] => 0.320438 [redirect_time] => 0 [certinfo] => Array ( ) [primary_ip] => --- [primary_port] => 80 [local_ip] => --- [local_port] => 53837 [redirect_url] => )

I read that some website had protections against this kind of scripts; and I did test some solutions proposed, but none worked for me (CURLOPT_USERAGENT,CURLOPT_REFERER...)

Any ideas of what's happening here ?

I would like to backup my LoveIt account, that's why i'm making this (no exports functions and no replies from LoveIt.com about the health of the website)

I quickly checked the said page with LiveHeaders enabled and I noticed bunch of cookies set. I suspect that, since it's not "normal" url, you need to hand those cookies while being redirected otherwise you end being kicked out with 404. Use CURLOPT_COOKIEJAR with your cURL instance at start. See: http://php.net/manual/pl/function.curl-setopt.php

I just had a similar issue with a site. In my case they were expecting a USER_AGENT to be set so anyone with this issue in the future should also check that.

You don't need to save the cookie file via chrome.

You can create a function to get this cookie, and then reuse it.

Like:

<?php

error_reporting(E_ALL);

Class Crawler{

   var $cookie;
   var $http_response;
   var $user_agent;

   function __construct($cookie){
       $this->cookie     = (string) $cookie;
       $this->user_agent = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0'; 
   }

   function get($url){
       $ch = curl_init();
       curl_setopt($ch, CURLOPT_URL, $this->url);
       curl_setopt($ch, CURLOPT_NOBODY, 1);
       curl_setopt($ch, CURLOPT_USERAGENT, $this->user_agent);
       // Here we create the file with cookies
       curl_setopt($ch, CURLOPT_COOKIEJAR, $this->cookie);
       $this->http_response = curl_exec($ch);
   }

   function get_with_cookies($url){
       $ch = curl_init();
       curl_setopt($ch, CURLOPT_URL, $url);
       curl_setopt($ch, CURLOPT_NOBODY, 1);
       curl_setopt($ch, CURLOPT_USERAGENT, $this->user_agent);
       curl_setopt($ch, CURLOPT_COOKIEJAR, $this->cookie);

       // Here we can re-use the cookie file keeping the save of the cookies 
       curl_setopt($ch, CURLOPT_COOKIEFILE, $this->cookie);
       $this->http_response = curl_exec($ch);
    }
}

$crawler = new Crawler('cookie_file_name');
// Creating cookie file
$crawler->get('uri');
// Request with the cookies
$crawler->get_with_cookies('uri');

Regards.

Thanks for your answer, so I did visit the page, saved the cookies in a cookies.txt file (with chrome extenson cookie.txt export) that I use NOT CURLOPT_COOKIEJAR, but for option CURLOPT_COOKIEFILE .

$cookiefile = './cookie.txt';

curl_setopt($curl, CURLOPT_COOKIEFILE, $cookiefile);

and now it works ! Thanks for your feedback, it was really useful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM