I have ~80 000 urls and I'd like to get response statuse codes for them. Note, that I'd like to get it as fast as possible. I've tried HEAD
and GET
requests using requests
python battery, but it's too slow for my goal. According to my calculations it shall take > 10 hours. It's sad. Another approach I've found is using tornado
. I've tested it (please, take a look at the code) on 500 urls. It made his work fast, but (!) a huge amount of response codes are 599. It's strange, then I've checked urls which map to 599 code through a browser (simple GET
request) and made sure that url is pretty fine. How to solve this problem?
from urlparse import urlparse
from threading import Thread
import httplib, sys
from Queue import Queue
from tornado import ioloop, httpclient, gen
import tornado
from time import sleep
i = 0
good = 0
def handle_request(response):
global good
if response.code != 200:
print response.code, response.reason, response.request.url
else:
good += 1
print 'KKKKKKKKKKK: ', good, '[%s]' % response.request.url
global i
i -= 1
if i == 0 or i < 0:
ioloop.IOLoop.instance().stop()
http_client = httpclient.AsyncHTTPClient()
lis = []
for url in open('urls'):
lis.append(url.strip())
specific_domain = '...'
for l in lis[:500]:
i += 1
method = 'GET' if specific_domain in l else 'HEAD'
req = tornado.httpclient.HTTPRequest(l, method=method, request_timeout=30.0)
http_client.fetch(req, handle_request)
ioloop.IOLoop.instance().start()
599 is the response code Tornado generates for an internal timeout. In this case most of the requests are probably timing out in the queue while waiting for a slot. You can either increase the timeouts (pass request_timeout
when making the request) or manage your own queue to feed requests into AsyncHTTPClient
only as fast as they can be handled (this is normally recommended for large crawling jobs as it lets you make your own decisions about prioritization and fairness across different hosts). For an example with a queue, see my answer in tornado: AsyncHttpClient.fetch from an iterator?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.