简体   繁体   English

Scrapy - 如何检查蜘蛛是否正在运行

[英]Scrapy - how to check if spider is running

I have a Scrapy spider which I run every hour using bash script and crontab . 我有一个Scrapy蜘蛛,我每小时使用bash脚本和crontab

The running time of the spider is about 50 minutes but can be more than hour. 蜘蛛的运行时间约为50分钟,但可能超过一小时。

What I want is to check whether the spider is running and only if not, start new crawling. 我想要的是检查蜘蛛是否正在运行,只有不运行,开始新的爬行。

BASH SCRIPT BASH SCRIPT

#!/usr/bin/env bash

source /home/milano/.virtualenvs/keywords_search/bin/activate
cd /home/milano/PycharmProjects/keywords_search/bot

# HERE I WANT TO CHECK, WHETHER THE PREVIOUS CRAWLING ALREADY STOPPED, IF NOT, DO NOTHING

scrapy crawl main_spider

The only thing which comes to my mind is to use telnet . 我想到的唯一一件事就是使用telnet

If it can connect - telnet localhost 6023 , it means that spider is still running otherwise I can run spider. 如果它可以连接 - telnet localhost 6023 ,这意味着蜘蛛仍在运行,否则我可以运行蜘蛛。

You need some sort of locking mechanism. 你需要某种锁定机制。

The best way to achieve an atomic lock from bash is to use mkdir and check the result code to know if you acquired the lock or not. 从bash实现原子锁的最佳方法是使用mkdir并检查结果代码以了解是否获得了锁。

Here's a more in depth explanation: http://wiki.bash-hackers.org/howto/mutex 这是一个更深入的解释: http//wiki.bash-hackers.org/howto/mutex

Of course you could always go for dirtier methods like a grep on process names or stuff like that. 当然,你总是可以选择更脏的方法,例如grep on process names或类似的东西。

You could also have a lock in scrapy itself, add a simple middleware check for a shared resource... Plenty of ways to do it :) 您也可以锁定scrapy本身,为共享资源添加一个简单的中间件检查......有很多方法可以做到:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM