簡體   English   中英

當蜘蛛出現 scrapy 和 Rabbitmq (pika) 錯誤時如何重新排隊消息

[英]How to requeue the messages when the spider has error with scrapy and Rabbitmq (pika)

我正在嘗試使用 pika 和 scrapy 來運行 MQ,並讓消費者調用蜘蛛。 我有一個consumer.py和一個 scrapy 蜘蛛spider.py

蜘蛛正在使用生產者發送的參數在消費者中運行。 我使用used_channel.basic_ack(delivery_tag=basic_deliver.delivery_tag)刪除消息。

我預計當蜘蛛完成工作時消息會被刪除,如果有錯誤消息應該重新排隊。 當蜘蛛正常運行時,一切看起來都很好; 消息被刪除,工作完成。 但是,如果在運行爬蟲時發生錯誤,消息仍然會被刪除,並且工作沒有完成,但消息丟失了。

我查看了 Rabbitmq 管理 UI,發現當蜘蛛仍在運行時消息變為 0(控制台尚未顯示工作已完成)。

我想知道是不是因為 scrapy 是異步的? 因此,當這一行run_spider(message=decodebody)仍在運行時,下一行used_channel.basic_ack(delivery_tag=basic_deliver.delivery_tag)不會等到蜘蛛完成。

我怎樣才能解決這個問題? 我想在蜘蛛正確完成工作后刪除該消息。

from scrapy.utils.project import get_project_settings

setup() # for CrawlerRunner
settings = get_project_settings()

def get_message(used_channel, basic_deliver, properties, body):
    decodebody = bytes.decode(body)

    try:
        run_spider(message=decodebody)
        used_channel.basic_ack(delivery_tag=basic_deliver.delivery_tag)

    except: 
        channel.basic_reject(delivery_tag=basic_deliver.delivery_tag)


def run_spider(message):
    crawler = CrawlerRunner(settings)
    crawler.crawl(MySpider, message=message)


while(True):
    try: 
        # blocking connection
        connection = pika.BlockingConnection(pika.ConnectionParameters(host=rabbit_host))
        channel = connection.channel()
        # declare exchange, the setting must be same as producer
        channel.exchange_declare(
            exchange=rabbit_exchange,
            exchange_type='direct',  
            durable=True,            
            auto_delete=False        
        )
        # declare queue, the setting must be same as producer
        channel.queue_declare(
            queue=rabbit_queue, 
            durable=True, 
            exclusive=False,
            auto_delete=False
        )
        # bind the setting
        channel.queue_bind(
            exchange=rabbit_exchange,
            queue=rabbit_queue,
            routing_key=routing_key
        )

        channel.basic_qos(prefetch_count=1) 
        channel.basic_consume(
            queue=rabbit_queue,
            on_message_callback=get_message,
            auto_ack=False
        )

        logger.info(' [*] Waiting for messages. To exit press CTRL+C')
        # start crawler
        channel.start_consuming()
    
    except pika.exceptions.ConnectionClosed as err:
        print('ConnectionClosed error:', err)
        continue
    # Do not recover on channel errors
    except pika.exceptions.AMQPChannelError as err:
        print("Caught a channel error: {}, stopping...".format(err))
        break
    # Recover on all other connection errors
    except pika.exceptions.AMQPConnectionError as err:    
        print("Connection was closed, retrying...", err)
        continue



我發現有人用 MQ for pika 庫處理多線程。 他使用.is_alive檢查線程是否完成。 所以,我遵循這個想法。 Scrapy 是多線程的,我添加了 return crawler ,並在刪除消息之前檢查crawler._active

scrapy.crawler 的源代碼

def run_spider(news_info):
    # run spider with CrawlerRunner
    crawler = CrawlerRunner(settings)
    # run the spider script
    crawler.crawl(UrlSpider, news_info=news_info)

    return crawler
crawler = run_spider(news_info=decodebody)
        
# wait until the crawler is done
while (len(crawler._active) > 0):
    time.sleep(1)

used_channel.basic_ack(delivery_tag=basic_deliver.delivery_tag)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM