[英]Scrapy spider not saving html files
I have a Scrapy spider that I've generated.我有一个我生成的 Scrapy 蜘蛛。 The purpose of the spider is to return network data for the purposes of graphing the network as well as to return the html files for each page the spider reaches.蜘蛛的目的是返回网络数据以绘制网络以及为蜘蛛到达的每个页面返回 html 文件。 The spider is achieving the first goal but not the second.蜘蛛正在实现第一个目标,但没有实现第二个目标。 It results in a csv file with the tracking information but I cannot see that it is saving the html files.它会生成一个带有跟踪信息的 csv 文件,但我看不到它正在保存 html 文件。
# -*- coding: utf-8 -*-
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.utils.url import urljoin_rfc
from sitegraph.items import SitegraphItem
class CrawlSpider(CrawlSpider):
name = "example"
custom_settings = {
'DEPTH_LIMIT': '1',
}
allowed_domains = []
start_urls = (
'http://exampleurl.com',
)
rules = (
Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
i = SitegraphItem()
i['url'] = response.url
# i['http_status'] = response.status
llinks=[]
for anchor in hxs.select('//a[@href]'):
href=anchor.select('@href').extract()[0]
if not href.lower().startswith("javascript"):
llinks.append(urljoin_rfc(response.url,href))
i['linkedurls'] = llinks
return i
def parse(self, response):
filename = response.url.split("/")[-1] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
The traceback I receive is as follows:我收到的回溯如下:
Traceback (most recent call last):
File "...\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://externalurl.com/> (failed 3 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.core.scraper] ERROR: Error downloading <GET http://externalurl.com/>
Traceback (most recent call last):
File "...\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv
parse
method: parse
方法:
According to scrapy docs and another stack overflow question it is not recommended to override parse
method because crawlspider use it to implement it's logic.根据scrapy docs和另一个堆栈溢出问题,不建议覆盖parse
方法,因为crawlspider使用它来实现它的逻辑。
If You need to override parse
method and in the same time count with如果您需要覆盖parse
方法并同时计数
Crawlspider.parse
original source code - You need to add it's original source to fix parse
method: Crawlspider.parse
原始源代码- 您需要添加它的原始源来修复parse
方法:
def parse(self, response):
filename = response.url.split("/")[-1] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
csv feed: csv提要:
This log line: 2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv
- means that csv feedexporter
enabled (probably in settings.py
project settings file.)此日志行: 2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv
- 表示启用了 csv feedexporter
(可能在settings.py
项目设置文件中) .)
UPDATE更新
I observed Crawlspider
source code again.我再次观察了Crawlspider
源代码。
It looks like parse
method called only once at the beginning and it don't cover all web responses.看起来parse
方法在开始时只调用了一次,并且没有涵盖所有的 Web 响应。
If my theory correct - after adding this function to your spider class should save all html responses:如果我的理论是正确的 - 将此函数添加到您的蜘蛛类后应该保存所有 html 响应:
def _response_downloaded(self, response):
filename = response.url.split("/")[-1] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
rule = self._rules[response.meta['rule']]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.