簡體   English   中英

Scrapy蜘蛛不保存html文件

[英]Scrapy spider not saving html files

我有一個我生成的 Scrapy 蜘蛛。 蜘蛛的目的是返回網絡數據以繪制網絡以及為蜘蛛到達的每個頁面返回 html 文件。 蜘蛛正在實現第一個目標,但沒有實現第二個目標。 它會生成一個帶有跟蹤信息的 csv 文件,但我看不到它正在保存 html 文件。

# -*- coding: utf-8 -*-
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.utils.url import urljoin_rfc
from sitegraph.items import SitegraphItem


class CrawlSpider(CrawlSpider):
    name = "example"
    custom_settings = {
    'DEPTH_LIMIT': '1',
    }
    allowed_domains = []
    start_urls = (
        'http://exampleurl.com',
    )

    rules = (
        Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        i = SitegraphItem()
        i['url'] = response.url
        # i['http_status'] = response.status
        llinks=[]
        for anchor in hxs.select('//a[@href]'):
            href=anchor.select('@href').extract()[0]
            if not href.lower().startswith("javascript"):
                llinks.append(urljoin_rfc(response.url,href))
        i['linkedurls'] = llinks
        return i

    def parse(self, response):
        filename = response.url.split("/")[-1] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

我收到的回溯如下:

Traceback (most recent call last):
  File "...\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://externalurl.com/> (failed 3 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.core.scraper] ERROR: Error downloading <GET http://externalurl.com/>
Traceback (most recent call last):
  File "...\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv

parse方法:

根據scrapy docs另一個堆棧溢出問題,不建議覆蓋parse方法,因為crawlspider使用它來實現它的邏輯。

如果您需要覆蓋parse方法並同時計數

Crawlspider.parse原始源代碼- 您需要添加它的原始源來修復parse方法:

def parse(self, response):
    filename = response.url.split("/")[-1] + '.html'
    with open(filename, 'wb') as f:
        f.write(response.body)
    return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

csv提要:
此日志行: 2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv - 表示啟用了 csv feedexporter (可能在settings.py項目設置文件中) .)

更新
我再次觀察了Crawlspider源代碼。
看起來parse方法在開始時只調用了一次,並且沒有涵蓋所有的 Web 響應。
如果我的理論是正確的 - 將此函數添加到您的蜘蛛類后應該保存所有 html 響應:

def _response_downloaded(self, response):
    filename = response.url.split("/")[-1] + '.html'
    with open(filename, 'wb') as f:
        f.write(response.body)    
    rule = self._rules[response.meta['rule']]
    return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM