简体   繁体   中英

curl returns 404 on valid page

I've got a PHP function that checks a URL to make sure that (a.) there's some kind of server response, and (b.) it's not a 404.

It works just fine on every domain/URL I've tested, with the exception of bostonglobe.com, where it's returning a 404 for valid URLs. I'm guessing it has something to do with their paywall, but my function works fine on nytimes.com and other newspaper sites.

Here's an example URL that returns a 404:

https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html

What am I doing wrong?

function check_url($url){
  $userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)';
  $curl = curl_init($url);
  curl_setopt($curl, CURLOPT_NOBODY, true);
  curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
  curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
  curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
  curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
  $result = curl_exec($curl);
  if ($result == false) {
      //There was no response
      $message = "No information found for that URL";
      } else {
      //What was the response?
      $statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);  
      if ($statusCode == 404) {
        $message = "No information found for that URL";
        } else{
        $message = "Good";
        }
      }
  return $message;
  }

The problem seems to come from you CURLOPT_NOBODY option.

I've tested your code both with and without this line and the http code returns 404 when CURLOPT_NOBODY is present, and 200 when it's not.

The PHP manual informs us that setting the CURLOPT_NOBODY option will transform your request method to HEAD , my guess is that the server on which bostonglobe.com is hosted doesn't support that method.

I checked this URL with curl command.

curl --head https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html

It returned an error .(HTTP/1.1 404 Not Found)

I also used another command use wget. The result was same.

wget –server-response --spider https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html

I also checked this case with web service ( HTTP request generator: http://web-sniffer.net/ ). The result was same.

Other URL cases in https://www.bostonglobe.com/ work for HEAD request only. but i think post page (extension .html) is not support head request.

server administrator or programmer shutdown head request?

for php,

if($_SERVER["REQUEST_METHOD"] == "HEAD"){
    // response 404 or using header method to redirect 
    exit;
}

or server soft(Apache and more) limit the HTTP request.

for example, this purpose is to reduce server load.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM