I am trying to do an automated task via python through the mechanize
module:
This works one-time. Now, I repeat this task for a list of keywords.
And am getting HTTP Error 429 (Too many requests).
I tried the following to workaround this:
Adding custom headers (I noted them down specifically for that very website by using a proxy ) so that it looks a legit browser request .
br=mechanize.Browser() br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')] br.addheaders = [('Connection', 'keep-alive')] br.addheaders = [('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')] br.addheaders = [('Upgrade-Insecure-Requests','1')] br.addheaders = [('Accept-Encoding',' gzip, deflate, sdch')] br.addheaders = [('Accept-Language','en-US,en;q=0.8')]`
Since the blocked response was coming for every 5th request , I tried sleeping for 20 sec after 5 requests .
Neither of the two methods worked.
You need to limit the rate of your requests to conform to what the server's configuration permits. ( Web Scraper: Limit to Requests Per Minute/Hour on Single Domain? may show the permitted rate)
mechanize
uses a heavily-patched version of urllib2
( Lib/site-packages/mechanize/_urllib2.py
) for network operations, and its Browser
class is a descendant of its _urllib2_fork.OpenerDirector
.
So, the simplest method to patch its logic seems to add a handler
to your Browser
object
default_open
and appropriate handler_order
to place it before everyone (lower is higher priority). return None
to push the request to the following handlers Since this is a common need, you should probably publish your implementation as an installable package.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.