简体   繁体   English

python:使用代理IP进行scrapy

[英]python : scrapy using proxy IP

I want to use proxy IP for web scraping using scrapy. 我想使用代理IP使用scrapy进行网络抓取。 In order to use a proxy I set the environment variable http_proxy as mentioned in the documentation. 为了使用代理,我按照文档中的说明设置了环境变量http_proxy

$ export http_proxy=http://proxy:port

To test whether the change of IP worked, I created a new spider with the name test : 为了测试IP的变化是否有效,我创建了一个名为test的新蜘蛛:

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule

class TestSpider(CrawlSpider):
    name = "test"
    domain_name = "whatismyip.com"
    start_urls = ["http://whatismyip.com"]

    def parse(self, response):
        print response.body
        open('check_ip.html', 'wb').write(response.body)

but if I run this spider the check_ip.html do not show the IP as specified in the environment variable rather it shows the original IP as it was before crawling. 但是如果我运行这个蜘蛛, check_ip.html不会显示环境变量中指定的IP,而是显示爬网前的原始IP。

What is the problem ? 问题是什么 ? is there any alternative way that I can check whether I am using a proxy IP or not? 有没有其他方法可以检查我是否使用代理IP? or is there any other way to use a proxy IP ? 或者有没有其他方法来使用代理IP?

Edit settings.py in your current project and make sure you have HttpProxyMiddleware enabled: 编辑当前项目中的settings.py并确保已启用HttpProxyMiddleware:

DOWNLOADER_MIDDLEWARES = { 
 #you need this line in order to scrap through a proxy/proxy list
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM