简体   繁体   English

Python机械化返回HTTP 429错误

[英]Python mechanize returns HTTP 429 error

I am trying to do an automated task via python through the mechanize module: 我试图通过mechanize模块通过python执行自动化任务:

  1. Enter the keyword in a web form, submit the form. 在网络表单中输入关键字,然后提交表单。
  2. Look for a specific element in the response. 在响应中查找特定元素。

This works one-time. 这是一次性的。 Now, I repeat this task for a list of keywords. 现在,我为关键字列表重复此任务。

And am getting HTTP Error 429 (Too many requests). 并且正在收到HTTP错误429(请求过多)。

I tried the following to workaround this: 我尝试以下解决方案:

  1. Adding custom headers (I noted them down specifically for that very website by using a proxy ) so that it looks a legit browser request . 添加自定义标头(我通过使用代理将其专门记录在该网站上),使它看起来像合法的浏览器请求。

     br=mechanize.Browser() br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')] br.addheaders = [('Connection', 'keep-alive')] br.addheaders = [('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')] br.addheaders = [('Upgrade-Insecure-Requests','1')] br.addheaders = [('Accept-Encoding',' gzip, deflate, sdch')] br.addheaders = [('Accept-Language','en-US,en;q=0.8')]` 
  2. Since the blocked response was coming for every 5th request , I tried sleeping for 20 sec after 5 requests . 由于每第5个请求都会出现阻止的响应,因此我尝试在5个请求后睡眠20秒。

Neither of the two methods worked. 两种方法均无效。

You need to limit the rate of your requests to conform to what the server's configuration permits. 您需要限制请求的速率,以符合服务器配置所允许的范围。 ( Web Scraper: Limit to Requests Per Minute/Hour on Single Domain? may show the permitted rate) 网络抓取工具:单个域每分钟/小时的请求限制?可能显示允许的速率)

mechanize uses a heavily-patched version of urllib2 ( Lib/site-packages/mechanize/_urllib2.py ) for network operations, and its Browser class is a descendant of its _urllib2_fork.OpenerDirector . mechanize使用urllib2的补丁程序版本( Lib/site-packages/mechanize/_urllib2.py )进行网络操作,其Browser类是_urllib2_fork.OpenerDirector的后代。

So, the simplest method to patch its logic seems to add a handler to your Browser object 因此,修补其逻辑的最简单方法似乎是向您的Browser对象添加一个handler

  • with default_open and appropriate handler_order to place it before everyone (lower is higher priority). 使用default_open和适当的handler_order将其放置在所有人之前(优先级较低)。
  • that would stall until the request is eligible with eg a Token bucket or Leaky bucket algorithm eg as implemented in Throttling with urllib2 . 直到请求符合Token桶Leaky桶算法(例如使用urllib2Throttling中实现)时,该请求才停止。 Note that a bucket should probably be per-domain or per-IP. 请注意,存储桶可能应该是每个域或每个IP。
  • and finally return None to push the request to the following handlers 最后return None将请求推送到以下处理程序

Since this is a common need, you should probably publish your implementation as an installable package. 由于这是普遍需要,因此您可能应该将实现发布为可安装的程序包。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM