简体   繁体   中英

Python mechanize returns HTTP 429 error

I am trying to do an automated task via python through the mechanize module:

  1. Enter the keyword in a web form, submit the form.
  2. Look for a specific element in the response.

This works one-time. Now, I repeat this task for a list of keywords.

And am getting HTTP Error 429 (Too many requests).

I tried the following to workaround this:

  1. Adding custom headers (I noted them down specifically for that very website by using a proxy ) so that it looks a legit browser request .

     br=mechanize.Browser() br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')] br.addheaders = [('Connection', 'keep-alive')] br.addheaders = [('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')] br.addheaders = [('Upgrade-Insecure-Requests','1')] br.addheaders = [('Accept-Encoding',' gzip, deflate, sdch')] br.addheaders = [('Accept-Language','en-US,en;q=0.8')]` 
  2. Since the blocked response was coming for every 5th request , I tried sleeping for 20 sec after 5 requests .

Neither of the two methods worked.

You need to limit the rate of your requests to conform to what the server's configuration permits. ( Web Scraper: Limit to Requests Per Minute/Hour on Single Domain? may show the permitted rate)

mechanize uses a heavily-patched version of urllib2 ( Lib/site-packages/mechanize/_urllib2.py ) for network operations, and its Browser class is a descendant of its _urllib2_fork.OpenerDirector .

So, the simplest method to patch its logic seems to add a handler to your Browser object

  • with default_open and appropriate handler_order to place it before everyone (lower is higher priority).
  • that would stall until the request is eligible with eg a Token bucket or Leaky bucket algorithm eg as implemented in Throttling with urllib2 . Note that a bucket should probably be per-domain or per-IP.
  • and finally return None to push the request to the following handlers

Since this is a common need, you should probably publish your implementation as an installable package.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM