如何在 Scrapy 中暂停蜘蛛

Question

I'm new in scrapy and I need to pause a spider after receiving a response error (like 407, 429).我是 scrapy 的新手，我需要在收到响应错误（如 407、429）后暂停蜘蛛。
Also, I should do this without using time.sleep() , and use middlewares or extensions.另外，我应该在不使用time.sleep()的情况下执行此操作，并使用中间件或扩展。

Here is my middlewares:这是我的中间件：

from scrapy import signals
from pydispatch import dispatcher

class Handle429:
    def __init__(self):
        dispatcher.connect(self.item_scraped, signal=signals.item_scraped)

    def item_scraped(self, item, spider, response):
        if response.status == 429:
            print("THIS IS 429 RESPONSE")
            #
            # here stop spider for 10 minutes and then continue
            #

I read about self.crawler.engine.pause() but how can I implement it in my middleware, and set a custom time for pause?我读到了self.crawler.engine.pause()但如何在我的中间件中实现它，并设置自定义的暂停时间？
Or is there another way to do this?还是有其他方法可以做到这一点？ Thanks.谢谢。

Answer 1

I have solved my problem.我已经解决了我的问题。 First of all, middleware can have default foo like process_response or process_request .首先，中间件可以有默认的 foo 像process_response或process_request 。

In settings.py在settings.py

HTTPERROR_ALLOWED_CODES = [404]

Then, I have changed my middleware class:然后，我更改了我的中间件 class：

from twisted.internet import reactor
from twisted.internet.defer import Deferred

#replace class Handle429
class HandleErrorResponse:

    def __init__(self):
        self.time_pause = 1800

    def process_response(self, request, response, spider):
        # this foo called by default before the spider 
        pass

Then I find a code that helps me to pause spider without time.sleep()然后我找到了一个代码可以帮助我在没有time.sleep()的情况下暂停蜘蛛

#in HandleErrorResponse
def process_response(self, request, response, spider):
    print(response.status)
    if response.status == 404:
        d = Deferred()
        reactor.callLater(self.time_pause, d.callback, response)

    return response

And it's work.这是工作。
I can't fully explain how reactor.callLater() works, but I think it just stops the event loop in scrapy, and then your response will be sent to the spider.我无法完全解释reactor.callLater()的工作原理，但我认为它只是停止了 scrapy 中的事件循环，然后您的响应将发送给蜘蛛。

如何在 Scrapy 中暂停蜘蛛

问题描述

1 个解决方案

解决方案1
1 2020-07-15 14:53:35

如何在 Scrapy 中暂停蜘蛛

问题描述

1 个解决方案

解决方案1 1 2020-07-15 14:53:35

解决方案1
1 2020-07-15 14:53:35