[英]How to crawl local HTML file with Scrapy
我尝试使用以下代码爬取存储在我桌面上的本地 HTML 文件,但在爬取过程中遇到以下错误,例如“No such file or directory: '/robots.txt'”。
[碎片命令]
$ scrapy crawl test -o test01.csv
[Scrapy 蜘蛛]
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = []
start_urls = ['file:///Users/Name/Desktop/test/test.html']
[错误]
2018-11-16 01:57:52 [scrapy.core.engine] INFO: Spider opened
2018-11-16 01:57:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-16 01:57:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-11-16 01:57:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 1 times): [Errno 2] No such file or directory: '/robots.txt'
2018-11-16 01:57:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 2 times): [Errno 2] No such file or directory: '/robots.txt'
在本地使用它时,我从不指定allowed_domains
。 尝试取出那行代码,看看它是否有效。
在您的错误中,它测试了您提供的“空”域。
要解决“No such file or directory: '/robots.txt'”的错误,您可以 go 到 settings.py 文件并注释该行:
#ROBOTSTXT_OBEY = True
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.