简体   繁体   English

Scrapy蜘蛛不保存html文件

[英]Scrapy spider not saving html files

I have a Scrapy spider that I've generated.我有一个我生成的 Scrapy 蜘蛛。 The purpose of the spider is to return network data for the purposes of graphing the network as well as to return the html files for each page the spider reaches.蜘蛛的目的是返回网络数据以绘制网络以及为蜘蛛到达的每个页面返回 html 文件。 The spider is achieving the first goal but not the second.蜘蛛正在实现第一个目标,但没有实现第二个目标。 It results in a csv file with the tracking information but I cannot see that it is saving the html files.它会生成一个带有跟踪信息的 csv 文件,但我看不到它正在保存 html 文件。

# -*- coding: utf-8 -*-
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.utils.url import urljoin_rfc
from sitegraph.items import SitegraphItem


class CrawlSpider(CrawlSpider):
    name = "example"
    custom_settings = {
    'DEPTH_LIMIT': '1',
    }
    allowed_domains = []
    start_urls = (
        'http://exampleurl.com',
    )

    rules = (
        Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        i = SitegraphItem()
        i['url'] = response.url
        # i['http_status'] = response.status
        llinks=[]
        for anchor in hxs.select('//a[@href]'):
            href=anchor.select('@href').extract()[0]
            if not href.lower().startswith("javascript"):
                llinks.append(urljoin_rfc(response.url,href))
        i['linkedurls'] = llinks
        return i

    def parse(self, response):
        filename = response.url.split("/")[-1] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

The traceback I receive is as follows:我收到的回溯如下:

Traceback (most recent call last):
  File "...\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://externalurl.com/> (failed 3 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.core.scraper] ERROR: Error downloading <GET http://externalurl.com/>
Traceback (most recent call last):
  File "...\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv

parse method: parse方法:

According to scrapy docs and another stack overflow question it is not recommended to override parse method because crawlspider use it to implement it's logic.根据scrapy docs另一个堆栈溢出问题,不建议覆盖parse方法,因为crawlspider使用它来实现它的逻辑。

If You need to override parse method and in the same time count with如果您需要覆盖parse方法并同时计数

Crawlspider.parse original source code - You need to add it's original source to fix parse method: Crawlspider.parse原始源代码- 您需要添加它的原始源来修复parse方法:

def parse(self, response):
    filename = response.url.split("/")[-1] + '.html'
    with open(filename, 'wb') as f:
        f.write(response.body)
    return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

csv feed: csv提要:
This log line: 2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv - means that csv feedexporter enabled (probably in settings.py project settings file.)此日志行: 2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv - 表示启用了 csv feedexporter (可能在settings.py项目设置文件中) .)

UPDATE更新
I observed Crawlspider source code again.我再次观察了Crawlspider源代码。
It looks like parse method called only once at the beginning and it don't cover all web responses.看起来parse方法在开始时只调用了一次,并且没有涵盖所有的 Web 响应。
If my theory correct - after adding this function to your spider class should save all html responses:如果我的理论是正确的 - 将此函数添加到您的蜘蛛类后应该保存所有 html 响应:

def _response_downloaded(self, response):
    filename = response.url.split("/")[-1] + '.html'
    with open(filename, 'wb') as f:
        f.write(response.body)    
    rule = self._rules[response.meta['rule']]
    return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM