如何限制非谷歌搜索引擎机器人抓取率，以便他们不会超过外部API请求限制？

Question

I'm building for a client an amazon-affiliate website that uses Amazon Product API to fetch data from their catalogue. 我正在为客户建立一个亚马逊联盟网站，该网站使用Amazon Product API从其目录中获取数据。

Amazon have a 1 per sec request limit. 亚马逊每秒有1个请求限制。

Google allow configuring the googlebot's crawl rate via their Webmasters Tools, so there is no issue with them. Google允许通过其网站管理员工具配置googlebot的抓取速度，因此它们没有任何问题。

I need advice about how to go about treating other search engine crawl bots. 我需要有关如何处理其他搜索引擎抓取机器人的建议。 What would be a good way to avoid as much possible, exceeding Amazon's API rate limit due to bot crawling? 什么是避免因机器人爬行而超过亚马逊API速率限制的好方法？

Answer 1

PHP PHP

If you want to follow PHP approach, follow my answer on php redirect url with og metatag (open graph) 如果你想遵循PHP方法，请按照我的回答使用og metatag的php重定向网址（打开图表）

robots.txt 的robots.txt

I would go with robots.txt file as it's fairly simple and saves time. 我会使用robots.txt文件，因为它非常简单并节省时间。 Generally, all bots respect, and abide the rules in this file. 通常，所有机器人都尊重并遵守此文件中的规则。 Create a file named robots.txt (type: text/plain) with the following rules 使用以下规则创建名为robots.txt （type：text / plain）的文件

User-agent: * 
Disallow: /path/to/dir/

Asterisk * is a wildcard indicating every user agent. 星号*是表示每个用户代理的通配符。

Disallow: /path/to/dir/

Disallow rule defines the paths you want bots not to crawl. Disallow规则定义了您希望机器人不要抓取的路径。 You can have multiple lines for different user-agents. 您可以为不同的用户代理使用多行。

User-agent: Googlebot
Disallow: /path1/

User-agent: Facebookhit
Disallow: /path2/

Above will allow to access to /path2/ to Googlebot but not Facebookhit , and vice versa. 上面将允许访问/path2/到Googlebot但不允许访问Facebookhit ，反之亦然。 You can read more here 你可以在这里阅读更多

如何限制非谷歌搜索引擎机器人抓取率，以便他们不会超过外部API请求限制？

问题描述

1 个解决方案

解决方案1
0 2016-09-15 19:45:48

PHP PHP

robots.txt 的robots.txt

如何限制非谷歌搜索引擎机器人抓取率，以便他们不会超过外部API请求限制？

问题描述

1 个解决方案

解决方案1 0 2016-09-15 19:45:48

PHP PHP

robots.txt 的robots.txt

解决方案1
0 2016-09-15 19:45:48