Scrapy和响应状态代码：如何检查它？

Question

我正在使用scrapy来抓取我的站点地图，检查404,302和200页。 但我似乎无法获得响应代码。 到目前为止这是我的代码：

from scrapy.contrib.spiders import SitemapSpider


class TothegoSitemapHomesSpider(SitemapSpider):
    name ='tothego_homes_spider'

    ## robe che ci servono per tothego ##
   sitemap_urls = []
   ok_log_file =       '/opt/Workspace/myapp/crawler/valid_output/ok_homes'
   bad_log_file =      '/opt/Workspace/myapp/crawler/bad_homes'
   fourohfour =        '/opt/Workspace/myapp/crawler/404/404_homes'

   def __init__(self, **kwargs):
        SitemapSpider.__init__(self)

        if len(kwargs) > 1:
            if 'domain' in kwargs:
                self.sitemap_urls = ['http://url_to_sitemap%s/sitemap.xml' % kwargs['domain']]

            if 'country' in kwargs:
                self.ok_log_file += "_%s.txt" % kwargs['country']
                self.bad_log_file += "_%s.txt" % kwargs['country']
                self.fourohfour += "_%s.txt" % kwargs['country']

        else:
            print "USAGE: scrapy [crawler_name] -a country=[country] -a domain=[domain] \nWith [crawler_name]:\n- tothego_homes_spider\n- tothego_cars_spider\n- tothego_jobs_spider\n"
            exit(1)

    def parse(self, response):
        try:
            if response.status == 404:
                ## 404 tracciate anche separatamente
                self.append(self.bad_log_file, response.url)
                self.append(self.fourohfour, response.url)

            elif response.status == 200:
                ## printa su ok_log_file
                self.append(self.ok_log_file, response.url)
            else:
                self.append(self.bad_log_file, response.url)

        except Exception, e:
            self.log('[eccezione] : %s' % e)
            pass

    def append(self, file, string):
        file = open(file, 'a')
        file.write(string+"\n")
        file.close()

从scrapy的文档中，他们说response.status参数是一个对应于响应状态代码的整数。 到目前为止，它只记录200个状态URL，而302没有写在输出文件上（但我可以在crawl.log中看到重定向）。 那么，我需要做些什么才能“捕获”302个请求并保存这些网址？

Answer 1

http://readthedocs.org/docs/scrapy/en/latest/topics/spider-middleware.html#module-scrapy.contrib.spidermiddleware.httperror

假设启用了默认的蜘蛛中间件，HttpErrorMiddleware会过滤掉200-300范围之外的响应代码。 您可以通过在蜘蛛上设置handle_httpstatus_list属性来告诉您要处理404s的中间件。

class TothegoSitemapHomesSpider(SitemapSpider):
    handle_httpstatus_list = [404]

Answer 2

这里只有完整的回复：

设置Handle_httpstatus_list = [302];
根据要求，在meta 上将dont_redirect设置为True 。

例如： Request(URL, meta={'dont_redirect': True});

Scrapy和响应状态代码：如何检查它？

问题描述

2 个解决方案

解决方案1
25 已采纳 2012-03-14 09:06:10

解决方案2
2 2017-08-31 12:54:17

Scrapy和响应状态代码：如何检查它？

问题描述

2 个解决方案

解决方案1 25 已采纳 2012-03-14 09:06:10

解决方案2 2 2017-08-31 12:54:17

解决方案1
25 已采纳 2012-03-14 09:06:10

解决方案2
2 2017-08-31 12:54:17