简体   繁体   English

Scrapy:存储损坏的外部链接并丢弃其余链接

[英]Scrapy: store broken external links and discard the rest

I want Scrapy to store only the external links that are broken (a response code different from 200, 301 or 302) but I'm stuck with this and the script keeps storing every external link on the output file. 我希望Scrapy仅存储断开的外部链接(响应代码与200、301或302不同),但是我对此一无所知,脚本始终将每个外部链接存储在输出文件中。 This is what I'm using: 这就是我正在使用的:

@staticmethod
def remote_file_to_array(url):

    return filter(None, urllib2.urlopen(url).read().splitlines())

@staticmethod
def sitemap_to_array(url):
    results = []
    body = urllib2.urlopen(url).read()
    sitemap = Sitemap(body)
    for item in sitemap:
        results.append(item['loc'])
    return results


def start_requests(self):


    target_domain = self.arg_target_domain
    print 'Target domain: ', target_domain


    self.rules = (

        Rule(LinkExtractor(allow_domains=[target_domain], unique=True),
             follow=True),

        Rule(LinkExtractor(unique=True),
             callback='parse_item',
             process_links='clean_links',
             follow=False),
    )
    self._compile_rules()


    start_urls = []
    if self.arg_start_urls.endswith('.xml'):
        print 'Sitemap detected!'
        start_urls = self.sitemap_to_array(self.arg_start_urls)
    elif self.arg_start_urls.endswith('.txt'):
        print 'Remote url list detected!'
        start_urls = self.remote_file_to_array(self.arg_start_urls)
    else: 
        start_urls = [self.arg_start_urls]
    print 'Start url count: ', len(start_urls)
    first_url = start_urls[0]
    print 'First url: ', first_url


    for url in start_urls:


        yield scrapy.Request(url, dont_filter=True)


def clean_links(self, links):
    for link in links:

        link.fragment = ''
        link.url = link.url.split('
        yield link


def parse_item(self, response):
    item = BrokenLinksItem()
    item['url'] = response.url
    item['status'] = response.status
    yield item

You need to pass the errback argument on the Request object, that works like the callback but for not accepted response statuses. 您需要在Request对象上传递errback参数,该参数的工作方式类似于callback但不接受响应状态。

I am not sure if that can be also achieved with rules , if not you'll need to define your own behaviour 我不确定rules是否也可以实现,否则,您需要定义自己的行为

Your best bet would be to use a Downloader Middleware to log the desired responses. 最好的选择是使用Downloader Middleware记录所需的响应。

from twisted.internet import defer
from twisted.internet.error import (ConnectError, ConnectionDone, ConnectionLost, ConnectionRefusedError,
                                    DNSLookupError, TCPTimedOutError, TimeoutError,)

class BrokenLinkMiddleware(object):

    ignore_http_status_codes = [200, 301, 302]
    exceptions_to_log = (ConnectError, ConnectionDone, ConnectionLost, ConnectionRefusedError, DNSLookupError, IOError,
                         ResponseFailed, TCPTimedOutError, TimeoutError, defer.TimeoutError)

    def process_response(self, request, response, spider):
        if response.status not in self.ignore_http_status_codes:
            # Do your logging here, response.url will have the url, 
            # response.status will have the status.
        return response

    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.exceptions_to_log):
            # Do your logging here

That handles some exceptions which may not indicate a broken link (like ConnectError , TimeoutError , and TCPTimedOutError ), but you may want to log them anyway. 这处理了一些可能不会表示链接断开的异常(例如ConnectErrorTimeoutErrorTCPTimedOutError ),但是您仍然想记录它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM