[英]How to use Scrapy with both Splash and Tor over Privoxy in Docker Compose
[英]How to connect to https site with Scrapy via Polipo over TOR?
不能完全確定問題出在哪里。
運行Python 2.7.3和Scrapy 0.16.5
我創建了一個非常簡單的Scrapy Spider,以測試與本地Polipo代理的連接,以便我可以通過TOR發送請求。 我的蜘蛛的基本代碼如下:
from scrapy.spider import BaseSpider
class TorSpider(BaseSpider):
name = "tor"
allowed_domains = ["check.torproject.org"]
start_urls = [
"https://check.torproject.org"
]
def parse(self, response):
print response.body
對於我的代理中間件,我定義了:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = settings.get('HTTP_PROXY')
我在設置文件中的HTTP_PROXY = 'http://localhost:8123'
定義為HTTP_PROXY = 'http://localhost:8123'
。
現在,如果我將起始URL更改為http://check.torproject.org ,則一切正常,沒有問題。
如果我嘗試針對https://check.torproject.org運行,每次都會收到400 Bad Request錯誤(我也嘗試過不同的https://站點,並且所有站點都有相同的問題):
2013-07-23 21:36:18+0100 [scrapy] INFO: Scrapy 0.16.5 started (bot: arachnid)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleware, ProxyMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled item pipelines:
2013-07-23 21:36:18+0100 [tor] INFO: Spider opened
2013-07-23 21:36:18+0100 [tor] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 1 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 2 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Gave up retrying <GET https://check.torproject.org> (failed 3 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Crawled (400) <GET https://check.torproject.org> (referer: None)
2013-07-23 21:36:18+0100 [tor] INFO: Closing spider (finished)
只是要仔細檢查我的TOR / Polipo設置是否有問題,我可以在終端中運行以下curl命令,並進行正常連接: curl --proxy localhost:8123 https://check.torproject.org/
關於這里有什么問題的任何建議嗎?
嘗試
rq.meta['proxy'] = 'http://127.0.0.1:8123'
就我而言,這是可行的
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.