簡體   English   中英

用scrapy抓取非拉丁域名

[英]Crawl non-latin domain with scrapy

我需要使用scrapy抓取“.рф”域區域中的一些網站。 Url有這樣的結構:“ http://сайтдляпримера.рф ”(這個網址不是真的,例如它給出了)。 當然,我嘗試使用的網站可以通過瀏覽器訪問。 我嘗試使用start_urls屬性開始爬行,例如:

start_urls = ['http://сайтдляпримера.рф']

還有start_requests函數:

def start_requests(self):
    return [scrapy.Request("http://сайтдляпримера.рф/", callback=self._test)]

他們都沒有按預期工作,我收到了以下控制台消息:

2016-01-01 19:02:01 [scrapy] INFO: Spider opened
2016-01-01 19:02:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-01 19:02:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-01 19:02:01 [scrapy] DEBUG: Retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 1 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] DEBUG: Retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 2 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] DEBUG: Gave up retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 3 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] ERROR: Error downloading <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84>: DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] INFO: Closing spider (finished)

*如果有問題,我需要在基於Linux的操作系統上使用scrapy。

有什么解決方案嗎? 如果可能的話,我可以通過_spider文件解決這個問題,因為我無法訪問框架的存儲庫(處理http請求的任何內容都沒有在那里修改)

在處理國際化域名(IDN)時,您需要使用idna編碼非ascii URL。 您需要將結果字節解碼為unicode字符串。 另請注意,構成協議名稱('http://')的url的ascii子字符串應單獨添加前綴,以便在進行idna編碼時不會搞砸:

'http://' + u'сайтдляпримера.рф'.encode('idna').decode('utf-8')

有關詳細信息,另請參閱此文檔

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM