简体   繁体   中英

Scrapy TCP connection timed out issue in python

I have an issue in the "start_requests" function in python. I am using proxy and port for scraping data from another site. But I got:

[scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min) [scrapy.downloadermiddlewares.retry] DEBUG: Retrying http://....../> (failed 2 times): TCP connection timed out: 110: Connection timed out.

My code is:

def get_proxy(self):
    self.conn = MySQLdb.connect(
        settings['MYSQL_HOST'],
        settings['MYSQL_USER'],
        settings['MYSQL_PASSWD'],
        settings['MYSQL_DBNAME'],
        charset = "utf8", use_unicode = True)
    self.cursor = self.conn.cursor()
    try:
        results = self.cursor.execute("SELECT proxy, port FROM geme_proxies WHERE is_active = '1' AND is_deleted = '0' ORDER BY RAND() LIMIT 1" )               
        if results > 0:
           row = self.cursor.fetchone()
           return row
        else:
          return

    except Exception, e:
      logger.error('Exception Message: '+ str(e))

def start_requests(self):
    proxy_data = self.get_proxy();
    urls = [settings['OBERWIL_NEWS_URL']]
    for url in urls:
        request =  scrapy.Request(url = url, callback = self.parse)
        request.meta['proxy'] = 'http://' + proxy_data[0] + ':' + proxy_data[1]
        proxy_user_pass = settings['PROXY_USERNAME'] + ':' + settings['PROXY_PASSWORD']
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
        yield request

Please help me to solve this issue.

I believe, this isn't a proper approach to use proxies in your code. (Free) Proxies die very often or become irrespective without any warning and since you are using a single proxy for loading all of your URLs, if first randomly chosen proxy has any issue(s), you will end up with the error.

A better approach would be to use "rotating proxies" instead:

pip install scrapy-rotated-proxy

This will allow you to rotate proxies transparently without having to handle middle processes yourself. The approach only requires installing the respository and then gradually updating the proxy list (file: proxylist.txt).

Activate using:

'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620

proxylist.txt:

165.22.50.208:8080
139.180.163.43:3128
14.207.137.192:8080

Rotating-proxies also have option(s) for switching from file to database along with other useful options for further optimizing your crawlers with respect to target website.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM