根据Scrapy中的url设置代理

Question

I have a list of URL some of them contain .onion sites and other a clear net sites I'm wondering is there a way to set up Scrapy so that according to the URL, so it either uses a dedicated clear net proxy for normal .com and .net sites or it uses a Socks5 proxy for .onion sites 我有一个URL列表，其中一些包含.onion网站和其他透明网站，我想知道是否有一种方法可以设置Scrapy，以便根据URL进行设置，因此它要么使用一个专用的clear net proxy作为normal。 com和.net网站，或者它对.onion网站使用Socks5代理

def random_dedicate_proxy():
    dedicated_ips = [  proxy1, proxy2, proxy3
                ]
    dedicated_proxies = [{'http':'http://' + ip, 'https':'https://' + ip} for ip in dedicated_ips]
    return choice(dedicated_proxies)

def proxy_selector(url):
    TOR_CLIENT = 'socks5h://127.0.0.1:9050'
    if '.onion' in url:
        proxy  = {'http': TOR_CLIENT, 'https': TOR_CLIENT}
    else:
        proxy = random_dedicate_proxy()
    return proxy

def get_urls_from_spreadsheet():
    fname = 'list_of_stuff.csv'
    url_df = pd.read_csv(fname,usecols=['URL'],keep_default_na=False)
    URL = url_df.URL.dropna()
    urls = [clean_url(url) for url in URL if url != '']
    return urls

class BasicSpider(scrapy.Spider):

    name = "basic"
    rotate_user_agent = True
    start_urls = get_urls_from_spreadsheet()


    def parse(self, response):
        item = StatusCehckerItem()
        item['url'] = response.url
        item['status_code'] = response.status
        item['time'] = time.time()
        response.meta['proxy'] = proxy_selector(response.url)
        return item

when using this code I get a DNSLookupError: DNS lookup failed: no results for hostname lookup: mqqrfjmfu2i73bjq.onion/. 使用此代码时，我得到DNSLookupError: DNS lookup failed: no results for hostname lookup: mqqrfjmfu2i73bjq.onion/.

Answer 1

Make sure HTTPPROXY_ENABLED is set to True in spider settings. 确保在蜘蛛设置HTTPPROXY_ENABLED设置为True 。 And then in your start_requests method pick the method to proxy URLs. 然后在您的start_requests方法中选择用于代理URL的方法。

class BasicSpider(scrapy.Spider):

    custom_settings = {
        'HTTPPROXY_ENABLED': True # can also set this in the settings.py file
    }
    name = "basic"
    rotate_user_agent = True

    def start_requests(self):
        urls = get_urls_from_spreadsheet()
        for url in urls:
            proxy = proxy_selector(url)
            yield scrapy.Request(url=url, callback=self.parse, meta={'proxy': proxy})

    def parse(self, response):
        item = StatusCehckerItem()
        item['url'] = response.url
        item['status_code'] = response.status
        item['time'] = time.time()
        return item

根据Scrapy中的url设置代理

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-05-08 18:06:06

根据Scrapy中的url设置代理

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-05-08 18:06:06

解决方案1
1 已采纳 2017-05-08 18:06:06