简体   繁体   English

如何限制非谷歌搜索引擎机器人抓取率,以便他们不会超过外部API请求限制?

[英]How to limit non-google search engine bots crawl rate so they don't push me over an external API request limit?

I'm building for a client an amazon-affiliate website that uses Amazon Product API to fetch data from their catalogue. 我正在为客户建立一个亚马逊联盟网站,该网站使用Amazon Product API从其目录中获取数据。

Amazon have a 1 per sec request limit. 亚马逊每秒有1个请求限制。

Google allow configuring the googlebot's crawl rate via their Webmasters Tools, so there is no issue with them. Google允许通过其网站管理员工具配置googlebot的抓取速度,因此它们没有任何问题。

I need advice about how to go about treating other search engine crawl bots. 我需要有关如何处理其他搜索引擎抓取机器人的建议。 What would be a good way to avoid as much possible, exceeding Amazon's API rate limit due to bot crawling? 什么是避免因机器人爬行而超过亚马逊API速率限制的好方法?

PHP PHP

If you want to follow PHP approach, follow my answer on php redirect url with og metatag (open graph) 如果你想遵循PHP方法,请按照我的回答使用og metatag的php重定向网址(打开图表)

robots.txt 的robots.txt

I would go with robots.txt file as it's fairly simple and saves time. 我会使用robots.txt文件,因为它非常简单并节省时间。 Generally, all bots respect, and abide the rules in this file. 通常,所有机器人都尊重并遵守此文件中的规则。 Create a file named robots.txt (type: text/plain) with the following rules 使用以下规则创建名为robots.txt (type:text / plain)的文件

User-agent: * 
Disallow: /path/to/dir/

Asterisk * is a wildcard indicating every user agent. 星号*是表示每个用户代理的通配符。

Disallow: /path/to/dir/

Disallow rule defines the paths you want bots not to crawl. Disallow规则定义了您希望机器人不要抓取的路径。 You can have multiple lines for different user-agents. 您可以为不同的用户代理使用多行。

User-agent: Googlebot
Disallow: /path1/

User-agent: Facebookhit
Disallow: /path2/

Above will allow to access to /path2/ to Googlebot but not Facebookhit , and vice versa. 上面将允许访问/path2/Googlebot但不允许访问Facebookhit ,反之亦然。 You can read more here 你可以在这里阅读更多

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM