[英]Crawl non-latin domain with scrapy
我需要使用scrapy抓取“.рф”域區域中的一些網站。 Url有這樣的結構:“ http://сайтдляпримера.рф ”(這個網址不是真的,例如它給出了)。 當然,我嘗試使用的網站可以通過瀏覽器訪問。 我嘗試使用start_urls
屬性開始爬行,例如:
start_urls = ['http://сайтдляпримера.рф']
還有start_requests
函數:
def start_requests(self):
return [scrapy.Request("http://сайтдляпримера.рф/", callback=self._test)]
他們都沒有按預期工作,我收到了以下控制台消息:
2016-01-01 19:02:01 [scrapy] INFO: Spider opened
2016-01-01 19:02:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-01 19:02:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-01 19:02:01 [scrapy] DEBUG: Retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 1 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] DEBUG: Retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 2 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] DEBUG: Gave up retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 3 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] ERROR: Error downloading <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84>: DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] INFO: Closing spider (finished)
*如果有問題,我需要在基於Linux的操作系統上使用scrapy。
有什么解決方案嗎? 如果可能的話,我可以通過_spider
文件解決這個問題,因為我無法訪問框架的存儲庫(處理http請求的任何內容都沒有在那里修改)
在處理國際化域名(IDN)時,您需要使用idna
編碼非ascii URL。 您需要將結果字節解碼為unicode字符串。 另請注意,構成協議名稱('http://')的url的ascii子字符串應單獨添加前綴,以便在進行idna
編碼時不會搞砸:
'http://' + u'сайтдляпримера.рф'.encode('idna').decode('utf-8')
有關詳細信息,另請參閱此文檔 。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.