[英]Scrapy - how to check if spider is running
I have a Scrapy
spider which I run every hour using bash
script and crontab
. 我有一个
Scrapy
蜘蛛,我每小时使用bash
脚本和crontab
。
The running time of the spider is about 50 minutes but can be more than hour. 蜘蛛的运行时间约为50分钟,但可能超过一小时。
What I want is to check whether the spider is running and only if not, start new crawling. 我想要的是检查蜘蛛是否正在运行,只有不运行,开始新的爬行。
BASH SCRIPT BASH SCRIPT
#!/usr/bin/env bash
source /home/milano/.virtualenvs/keywords_search/bin/activate
cd /home/milano/PycharmProjects/keywords_search/bot
# HERE I WANT TO CHECK, WHETHER THE PREVIOUS CRAWLING ALREADY STOPPED, IF NOT, DO NOTHING
scrapy crawl main_spider
The only thing which comes to my mind is to use telnet
. 我想到的唯一一件事就是使用
telnet
。
If it can connect - telnet localhost 6023
, it means that spider is still running otherwise I can run spider. 如果它可以连接 -
telnet localhost 6023
,这意味着蜘蛛仍在运行,否则我可以运行蜘蛛。
You need some sort of locking mechanism. 你需要某种锁定机制。
The best way to achieve an atomic lock from bash is to use mkdir and check the result code to know if you acquired the lock or not. 从bash实现原子锁的最佳方法是使用mkdir并检查结果代码以了解是否获得了锁。
Here's a more in depth explanation: http://wiki.bash-hackers.org/howto/mutex 这是一个更深入的解释: http : //wiki.bash-hackers.org/howto/mutex
Of course you could always go for dirtier methods like a grep on process names or stuff like that. 当然,你总是可以选择更脏的方法,例如grep on process names或类似的东西。
You could also have a lock in scrapy itself, add a simple middleware check for a shared resource... Plenty of ways to do it :) 您也可以锁定scrapy本身,为共享资源添加一个简单的中间件检查......有很多方法可以做到:)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.