簡體   English   中英

如何通過TOR通過Polipo使用Scrapy連接到https站點?

[英]How to connect to https site with Scrapy via Polipo over TOR?

不能完全確定問題出在哪里。

運行Python 2.7.3和Scrapy 0.16.5

我創建了一個非常簡單的Scrapy Spider,以測試與本地Polipo代理的連接,以便我可以通過TOR發送請求。 我的蜘蛛的基本代碼如下:

from scrapy.spider import BaseSpider

class TorSpider(BaseSpider):
    name = "tor"
    allowed_domains = ["check.torproject.org"]
    start_urls = [
        "https://check.torproject.org"
    ]

    def parse(self, response):
        print response.body

對於我的代理中間件,我定義了:

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = settings.get('HTTP_PROXY')

我在設置文件中的HTTP_PROXY = 'http://localhost:8123'定義為HTTP_PROXY = 'http://localhost:8123'

現在,如果我將起始URL更改為http://check.torproject.org ,則一切正常,沒有問題。

如果我嘗試針對https://check.torproject.org運行,每次都會收到400 Bad Request錯誤(我也嘗試過不同的https://站點,並且所有站點都有相同的問題):

2013-07-23 21:36:18+0100 [scrapy] INFO: Scrapy 0.16.5 started (bot: arachnid)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleware, ProxyMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-23 21:36:18+0100 [tor] INFO: Spider opened
2013-07-23 21:36:18+0100 [tor] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 1 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 2 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Gave up retrying <GET https://check.torproject.org> (failed 3 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Crawled (400) <GET https://check.torproject.org> (referer: None)
2013-07-23 21:36:18+0100 [tor] INFO: Closing spider (finished)

只是要仔細檢查我的TOR / Polipo設置是否有問題,我可以在終端中運行以下curl命令,並進行正常連接: curl --proxy localhost:8123 https://check.torproject.org/

關於這里有什么問題的任何建議嗎?

嘗試

rq.meta['proxy'] = 'http://127.0.0.1:8123'

就我而言,這是可行的

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM