简体   繁体   中英

Tell search engines that page does not exist

I have checked the logs and found that the search engines visits a lot of bogus URL's on my website. They are most likely from before a lot of the links were changed, and even though I have made 301 redirects some links have been altered in very strange ways and aren't recognized by my .htaccess file.

All requests are handled by index.php. If a response can't be created due to a bad URL a custom error page is presented instead. With simplified code index.php looks like this

try {
  $Request = new Request();
  $Request->respond();
} catch(NoresponseException $e) {
  $Request->presentErrorPage();
}

I just realized that this page returns a status 200 telling the bot that the page is valid even though it ain't.

Is it enough to add a header with 404 in the catch statement to tell the bots to stop visiting that page?

Like this:

header("HTTP/1.0 404 Not Found");

It looks OK when I tests it, but I'm worried that SE bots (and maybe user agents) will get confused.

You're getting there. The idea is correct - you want to give them a 404. However, just one tiny correction: if the client queries using HTTP/1.1 and you answer using 1.0, some clients will get confused.

The way around this is as follows:

header($_SERVER['SERVER_PROTOCOL']." 404 Not Found");

The SE bots DO get confused when they see this:

HTTP/1.1 200 OK

<h1>The page your requested does not exist</h1>

Or this:

HTTP/1.1 302 Object moved
Location: /fancy-404-error-page.html

It is explained here :

Returning a code other than 404 or 410 for a non-existent page (or redirecting users to another page, such as the homepage, instead of returning a 404) can be problematic. Firstly, it tells search engines that there's a real page at that URL. As a result, that URL may be crawled and its content indexed. Because of the time Googlebot spends on non-existent pages, your unique URLs may not be discovered as quickly or visited as frequently and your site's crawl coverage may be impacted (also, you probably don't want your site to rank well for the search query File not found ).

Your idea about programmatically sending the 404 header is correct and it instructs the search engine that the URL they requested does not exist and they should not attempt to crawl and index it. Ways to set response status:

header($_SERVER["SERVER_PROTOCOL"] . " 404 Not Found");

header(":", true, 404);  // this is used to set a header AND modify the http response code
                         // ":" is used as a hack to avoid specifying a real header

http_response_code(404); // PHP >= 5.4

A well-behaved crawler respects robots.txt at the top level of your site. If you want to exclude crawlers, then @SalmanA's response will work. A sample robots.txt file follows:

User-agent: *
Disallow: /foo/*
Disallow: /bar/*
Disallow: /hd1/*

It needs to be readable by all. Note this is not going to get users off the pages, just a bot that respects robots.txt, which most of them do.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM