Scrapy：存儲損壞的外部鏈接並丟棄其余鏈接

Question

我希望Scrapy僅存儲斷開的外部鏈接（響應代碼與200、301或302不同），但是我對此一無所知，腳本始終將每個外部鏈接存儲在輸出文件中。 這就是我正在使用的：

@staticmethod
def remote_file_to_array(url):

    return filter(None, urllib2.urlopen(url).read().splitlines())

@staticmethod
def sitemap_to_array(url):
    results = []
    body = urllib2.urlopen(url).read()
    sitemap = Sitemap(body)
    for item in sitemap:
        results.append(item['loc'])
    return results


def start_requests(self):


    target_domain = self.arg_target_domain
    print 'Target domain: ', target_domain


    self.rules = (

        Rule(LinkExtractor(allow_domains=[target_domain], unique=True),
             follow=True),

        Rule(LinkExtractor(unique=True),
             callback='parse_item',
             process_links='clean_links',
             follow=False),
    )
    self._compile_rules()


    start_urls = []
    if self.arg_start_urls.endswith('.xml'):
        print 'Sitemap detected!'
        start_urls = self.sitemap_to_array(self.arg_start_urls)
    elif self.arg_start_urls.endswith('.txt'):
        print 'Remote url list detected!'
        start_urls = self.remote_file_to_array(self.arg_start_urls)
    else: 
        start_urls = [self.arg_start_urls]
    print 'Start url count: ', len(start_urls)
    first_url = start_urls[0]
    print 'First url: ', first_url


    for url in start_urls:


        yield scrapy.Request(url, dont_filter=True)


def clean_links(self, links):
    for link in links:

        link.fragment = ''
        link.url = link.url.split('
        yield link


def parse_item(self, response):
    item = BrokenLinksItem()
    item['url'] = response.url
    item['status'] = response.status
    yield item

Answer 1

您需要在Request對象上傳遞errback參數，該參數的工作方式類似於callback但不接受響應狀態。

我不確定rules是否也可以實現，否則，您需要定義自己的行為

Answer 2

最好的選擇是使用Downloader Middleware記錄所需的響應。

from twisted.internet import defer
from twisted.internet.error import (ConnectError, ConnectionDone, ConnectionLost, ConnectionRefusedError,
                                    DNSLookupError, TCPTimedOutError, TimeoutError,)

class BrokenLinkMiddleware(object):

    ignore_http_status_codes = [200, 301, 302]
    exceptions_to_log = (ConnectError, ConnectionDone, ConnectionLost, ConnectionRefusedError, DNSLookupError, IOError,
                         ResponseFailed, TCPTimedOutError, TimeoutError, defer.TimeoutError)

    def process_response(self, request, response, spider):
        if response.status not in self.ignore_http_status_codes:
            # Do your logging here, response.url will have the url, 
            # response.status will have the status.
        return response

    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.exceptions_to_log):
            # Do your logging here

這處理了一些可能不會表示鏈接斷開的異常（例如ConnectError ， TimeoutError和TCPTimedOutError ），但是您仍然想記錄它們。

Scrapy：存儲損壞的外部鏈接並丟棄其余鏈接

問題描述

2 個解決方案

解決方案1
0 2015-11-09 15:52:10

解決方案2
0 已采納 2015-11-09 19:28:08

Scrapy：存儲損壞的外部鏈接並丟棄其余鏈接

問題描述

2 個解決方案

解決方案1 0 2015-11-09 15:52:10

解決方案2 0 已采納 2015-11-09 19:28:08

解決方案1
0 2015-11-09 15:52:10

解決方案2
0 已采納 2015-11-09 19:28:08