简体   繁体   English

从 Scrapy 管道中提升关闭蜘蛛

[英]Raise close spider from Scrapy pipeline

I need to raise CloseSpider from a Scrapy Pipeline.我需要从 Scrapy 管道中提升 CloseSpider。 Either that or return some parameter from the Pipeline back to the Spider to do the raise.要么将管道中的一些参数返回给 Spider 以进行加注。

For example, if the date already exists raise CloseSpider:例如,如果日期已经存在,则引发 CloseSpider:

raise CloseSpider('Already been scraped:' + response.url)

Is there a way to do this?有没有办法做到这一点?

As from scrapy docs, CloseSpider Exception can only be raised from a callback function (by default parse function) in a Spider only.从scrapy docs 开始,CloseSpider 异常只能从Spider 中的回调函数(默认为解析函数)引发。 Raising it in pipeline will crash spider.在管道中提高它会使蜘蛛崩溃。 To achieve the similar results from a pipeline, you can initiate a shutdown signal, that will close scrapy gracefully.为了从管道中获得类似的结果,您可以启动一个关闭信号,这将优雅地关闭scrapy。

from scrapy.project import crawler  
crawler._signal_shutdown(9,0)

Do remember ,scrapy might process already fired or even scheduled requests even after initiating shutdown signal.请记住,即使在启动关闭信号之后,scrapy 也可能会处理已经触发甚至预定的请求。

To do it from Spider, set some variable in Spider from Pipeline like this.要从 Spider 执行此操作,请像这样在来自 Pipeline 的 Spider 中设置一些变量。

def process_item(self, item, spider):
    if some_condition_is_met:
        spider.close_manually = True

After this in the callback function of your spider , you can raise close spider exception.在您的蜘蛛的回调函数中,您可以在此之后引发关闭蜘蛛异常。

def parse(self, response):
    if self.close_manually:
        raise CloseSpider('Already been scraped.')

I prefer the following solution.我更喜欢以下解决方案。

class MongoDBPipeline(object):

def process_item(self, item, spider):
    spider.crawler.engine.close_spider(self, reason='duplicate')

Source: Force spider to stop in scrapy来源: 强制蜘蛛停止在scrapy

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM