[英]How to requeue the messages when the spider has error with scrapy and Rabbitmq (pika)
我正在嘗試使用 pika 和 scrapy 來運行 MQ,並讓消費者調用蜘蛛。 我有一個consumer.py
和一個 scrapy 蜘蛛spider.py
。
蜘蛛正在使用生產者發送的參數在消費者中運行。 我使用used_channel.basic_ack(delivery_tag=basic_deliver.delivery_tag)
刪除消息。
我預計當蜘蛛完成工作時消息會被刪除,如果有錯誤消息應該重新排隊。 當蜘蛛正常運行時,一切看起來都很好; 消息被刪除,工作完成。 但是,如果在運行爬蟲時發生錯誤,消息仍然會被刪除,並且工作沒有完成,但消息丟失了。
我查看了 Rabbitmq 管理 UI,發現當蜘蛛仍在運行時消息變為 0(控制台尚未顯示工作已完成)。
我想知道是不是因為 scrapy 是異步的? 因此,當這一行run_spider(message=decodebody)
仍在運行時,下一行used_channel.basic_ack(delivery_tag=basic_deliver.delivery_tag)
不會等到蜘蛛完成。
我怎樣才能解決這個問題? 我想在蜘蛛正確完成工作后刪除該消息。
from scrapy.utils.project import get_project_settings
setup() # for CrawlerRunner
settings = get_project_settings()
def get_message(used_channel, basic_deliver, properties, body):
decodebody = bytes.decode(body)
try:
run_spider(message=decodebody)
used_channel.basic_ack(delivery_tag=basic_deliver.delivery_tag)
except:
channel.basic_reject(delivery_tag=basic_deliver.delivery_tag)
def run_spider(message):
crawler = CrawlerRunner(settings)
crawler.crawl(MySpider, message=message)
while(True):
try:
# blocking connection
connection = pika.BlockingConnection(pika.ConnectionParameters(host=rabbit_host))
channel = connection.channel()
# declare exchange, the setting must be same as producer
channel.exchange_declare(
exchange=rabbit_exchange,
exchange_type='direct',
durable=True,
auto_delete=False
)
# declare queue, the setting must be same as producer
channel.queue_declare(
queue=rabbit_queue,
durable=True,
exclusive=False,
auto_delete=False
)
# bind the setting
channel.queue_bind(
exchange=rabbit_exchange,
queue=rabbit_queue,
routing_key=routing_key
)
channel.basic_qos(prefetch_count=1)
channel.basic_consume(
queue=rabbit_queue,
on_message_callback=get_message,
auto_ack=False
)
logger.info(' [*] Waiting for messages. To exit press CTRL+C')
# start crawler
channel.start_consuming()
except pika.exceptions.ConnectionClosed as err:
print('ConnectionClosed error:', err)
continue
# Do not recover on channel errors
except pika.exceptions.AMQPChannelError as err:
print("Caught a channel error: {}, stopping...".format(err))
break
# Recover on all other connection errors
except pika.exceptions.AMQPConnectionError as err:
print("Connection was closed, retrying...", err)
continue
我發現有人用 MQ for pika 庫處理多線程。 他使用.is_alive
檢查線程是否完成。 所以,我遵循這個想法。 Scrapy 是多線程的,我添加了 return crawler
,並在刪除消息之前檢查crawler._active
。
def run_spider(news_info):
# run spider with CrawlerRunner
crawler = CrawlerRunner(settings)
# run the spider script
crawler.crawl(UrlSpider, news_info=news_info)
return crawler
crawler = run_spider(news_info=decodebody)
# wait until the crawler is done
while (len(crawler._active) > 0):
time.sleep(1)
used_channel.basic_ack(delivery_tag=basic_deliver.delivery_tag)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.