简体繁体 English

如何隐藏攻击性爬虫？

[英]How to hide an aggressive crawler?

原文 2012-12-25 12:45:22 9 2 php/ web-crawler

I'm planning to crawl a specific site. 我正计划抓取特定站点。 I have 3000 specific pages that I want to crawl once every few months. 我有3000个特定的页面，每几个月要抓取一次。 I've created a crawler, but I don't want to be banned from the site. 我已经创建了一个搜寻器，但是我不想被该网站禁止。

Is there a way to reduce the aggressiveness of the crawler or hide it in some way so not to be "noticed" or cause issues for the provider/website that I'm crawling? 有没有一种方法可以减少搜寻器的攻击性或以某种方式将其隐藏，以免被“发现”或对我正在搜寻的提供者/网站造成问题？

A delay is possible, but if I set it to random 10-30 second delay per page then it will take forever. 延迟是可能的，但是如果我将其设置为每页随机10-30秒的延迟，那么它将花费很多时间。

ANy tips or guidelines to make an acceptable crawler? 是否有制作可接受的爬虫的技巧或指南？

2 个解决方案

One more solution is to use PROXY server provider ( like this one for example ) and rotate IP address every X requests. 另一种解决方案是使用PROXY服务器提供程序 （ 例如这样的 服务器 ），并每X个请求轮换IP地址。 This particular provider has an API to retrieve IPs on the fly. 该特定的提供程序具有一个API，可以即时检索IP。 cURL can be used for this purpose easily if speaking about PHP . 如果谈到PHP， cURL可以轻松用于此目的。

This technique works in most cases, but it requires a bit more planning and tuning. 该技术在大多数情况下都有效，但是需要更多的计划和调整。 Anyway you will face some limitations. 无论如何，您将面临一些限制。 It can be as time issue as well as the number of requests per period what is almost the same issue as time ones. 与时间问题以及每个时间段的请求数一样，这可能是时间问题，也可能是时间问题。 Or you will need more proxy servers to satisfy your time requirements. 或者，您将需要更多的代理服务器来满足您的时间要求。

And read attentively TOS of providers. 并认真阅读供应商的服务条款。 This particular provider doesn't allow you to be banned by Google and some other sites. 此特定提供商不允许您被Google和其他一些网站禁止。 Otherwise your account will be banned also. 否则，您的帐户也将被禁止。

"Acceptable" is a relative term. “可接受”是一个相对术语。 Some site owners have enough processing power and bandwidth that they don't think scanning 3000 pages per hour is "aggressive". 一些站点所有者具有足够的处理能力和带宽，以至于他们认为每小时扫描3000页并不具有“侵略性”。 Some site owners struggle for bandwidth or processing power and can't keep up with 3000 page reads per day. 一些网站所有者争夺带宽或处理能力，无法跟上每天3000页的读取。

If you want to read pages and get current contents, then you must read the pages. 如果要阅读页面并获取当前内容，则必须阅读页面。 There's no shortcut to that. 没有捷径可走。