简体   繁体   中英

Get tens of thousands HTTP response codes

I have ~80 000 urls and I'd like to get response statuse codes for them. Note, that I'd like to get it as fast as possible. I've tried HEAD and GET requests using requests python battery, but it's too slow for my goal. According to my calculations it shall take > 10 hours. It's sad. Another approach I've found is using tornado . I've tested it (please, take a look at the code) on 500 urls. It made his work fast, but (!) a huge amount of response codes are 599. It's strange, then I've checked urls which map to 599 code through a browser (simple GET request) and made sure that url is pretty fine. How to solve this problem?

from urlparse import urlparse
from threading import Thread
import httplib, sys
from Queue import Queue
from tornado import ioloop, httpclient, gen
import tornado
from time import sleep

i = 0
good = 0


def handle_request(response):
    global good
    if response.code != 200:
        print response.code, response.reason, response.request.url
    else:
      good += 1
      print 'KKKKKKKKKKK: ', good, '[%s]' % response.request.url
    global i
    i -= 1
    if i == 0 or i < 0:
        ioloop.IOLoop.instance().stop()


http_client = httpclient.AsyncHTTPClient()
lis = []
for url in open('urls'):
    lis.append(url.strip())
specific_domain = '...'
for l in lis[:500]:
    i += 1
    method = 'GET' if specific_domain in l else 'HEAD'
    req = tornado.httpclient.HTTPRequest(l, method=method, request_timeout=30.0)
    http_client.fetch(req, handle_request)

ioloop.IOLoop.instance().start()

599 is the response code Tornado generates for an internal timeout. In this case most of the requests are probably timing out in the queue while waiting for a slot. You can either increase the timeouts (pass request_timeout when making the request) or manage your own queue to feed requests into AsyncHTTPClient only as fast as they can be handled (this is normally recommended for large crawling jobs as it lets you make your own decisions about prioritization and fairness across different hosts). For an example with a queue, see my answer in tornado: AsyncHttpClient.fetch from an iterator?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM