[英]How to pause spider in Scrapy
I'm new in scrapy and I need to pause a spider after receiving a response error (like 407, 429).我是 scrapy 的新手,我需要在收到响应错误(如 407、429)后暂停蜘蛛。
Also, I should do this without using time.sleep()
, and use middlewares or extensions.另外,我应该在不使用
time.sleep()
的情况下执行此操作,并使用中间件或扩展。
Here is my middlewares:这是我的中间件:
from scrapy import signals
from pydispatch import dispatcher
class Handle429:
def __init__(self):
dispatcher.connect(self.item_scraped, signal=signals.item_scraped)
def item_scraped(self, item, spider, response):
if response.status == 429:
print("THIS IS 429 RESPONSE")
#
# here stop spider for 10 minutes and then continue
#
I read about self.crawler.engine.pause()
but how can I implement it in my middleware, and set a custom time for pause?我读到了
self.crawler.engine.pause()
但如何在我的中间件中实现它,并设置自定义的暂停时间?
Or is there another way to do this?还是有其他方法可以做到这一点? Thanks.
谢谢。
I have solved my problem.我已经解决了我的问题。 First of all, middleware can have default foo like
process_response
or process_request
.首先,中间件可以有默认的 foo 像
process_response
或process_request
。
In settings.py在settings.py
HTTPERROR_ALLOWED_CODES = [404]
Then, I have changed my middleware class:然后,我更改了我的中间件 class:
from twisted.internet import reactor
from twisted.internet.defer import Deferred
#replace class Handle429
class HandleErrorResponse:
def __init__(self):
self.time_pause = 1800
def process_response(self, request, response, spider):
# this foo called by default before the spider
pass
Then I find a code that helps me to pause spider without time.sleep()
然后我找到了一个代码可以帮助我在没有
time.sleep()
的情况下暂停蜘蛛
#in HandleErrorResponse
def process_response(self, request, response, spider):
print(response.status)
if response.status == 404:
d = Deferred()
reactor.callLater(self.time_pause, d.callback, response)
return response
And it's work.这是工作。
I can't fully explain how reactor.callLater()
works, but I think it just stops the event loop in scrapy, and then your response will be sent to the spider.我无法完全解释
reactor.callLater()
的工作原理,但我认为它只是停止了 scrapy 中的事件循环,然后您的响应将发送给蜘蛛。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.