简体   繁体   English

无法在Scrapy spider中使用多个代理

[英]Unable to use multiple proxies within Scrapy spider

I've written a script in python using Scrapy to send a request to a webpage through proxy without changing anything in the settings.py or DOWNLOADER_MIDDLEWARES . 我在python中编写了一个脚本,使用Scrapy通过代理向网页发送请求,而不更改settings.pyDOWNLOADER_MIDDLEWARES中的任何内容。 It is working great now. 它现在很好用。 However, the only thing I can't make use of is creating a list of proxies so that If one fails another will be in use. 但是,我唯一不能使用的是创建一个代理列表,这样如果一个失败,另一个将被使用。 How can I twitch this portion os.environ["http_proxy"] = "http://176.58.125.65:80" to get list of proxies one by one as it supports only one. 我如何抽取这部分os.environ["http_proxy"] = "http://176.58.125.65:80"来逐一获取代理列表,因为它只支持一个代理。 Any help on this will be highly appreciated. 任何有关这方面的帮助将受到高度赞赏。

This is what I've tried so far (working one): 这是我到目前为止尝试过的(工作一个):

import scrapy, os
from scrapy.crawler import CrawlerProcess

class ProxyCheckerSpider(scrapy.Spider):
    name = 'lagado'
    start_urls = ['http://www.lagado.com/proxy-test']
    os.environ["http_proxy"] = "http://176.58.125.65:80" #can't modify this portion to get list of proxies

    def parse(self, response):
        stat = response.css(".main-panel p::text").extract()[1:3]
        yield {"Proxy-Status":stat}

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

})
c.crawl(ProxyCheckerSpider)
c.start()

I do not want to change anything in the settings.py or create any custom middleware to serve the purpose. 我不想更改settings.py任何内容或创建任何自定义middleware来实现此目的。 I wish to achieve the same (externally) like I did above with a single proxy. 我希望通过单个代理实现与上面相同的(外部)。 Thanks. 谢谢。

You can also set the meta key proxy per-request, to a value like http://some_proxy_server:port or http://username:password@some_proxy_server:port . 您还可以将每个请求的元键代理设置为类似http:// some_proxy_server:porthttp:// username:password @ some_proxy_server:port的值

from official docs: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware 来自官方文档: https//doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware

So you need to write your own middleware that would do: 所以你需要编写自己的中间件来做:

  1. Catch failed responses 抓住失败的回复
  2. If response failed because of proxy: 如果由于代理响应失败:
    1. replace request.meta['proxy'] value with new proxy ip 用新代理ip替换request.meta['proxy']
    2. reschedule request 重新安排请求

Alternative you can look into scrapy extensions packages that are already made to solve this: https://github.com/TeamHG-Memex/scrapy-rotating-proxies 另外,您可以查看已经解决此问题的scrapy扩展包: https//github.com/TeamHG-Memex/scrapy-rotating-proxies

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM