[英]Scrapy : not able to schedule
I want to run a spider every couple of minutes.我想每隔几分钟运行一个蜘蛛。 I put the following script in my project that I want to call for this purpose.
我将以下脚本放在我想为此目的调用的项目中。
import schedule, os
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
def job():
process = CrawlerProcess(get_project_settings())
process.crawl('amazon_spider')
process.start() # error: twisted.internet.error.ReactorNotRestartable
#process.start(stop_after_crawl=False) #process get stuck
while True:
schedule.run_pending()
schedule.every().minutes.do(job)
With this appoach the process get the following error:使用此方法,过程会出现以下错误:
twisted.internet.error.ReactorNotRestartable or stuck if I put process.start(stop_after_crawl=False)如果我放置process.start(stop_after_crawl=False), twisted.internet.error.ReactorNotRestartable或卡住
From a previous stackoverflow posted I also try this :从之前发布的 stackoverflow 中,我也尝试了这个:
from twisted.internet import reactor
from amazon.spiders.amazon_spider import AmazonSpider
from scrapy.crawler import CrawlerRunner
def run_crawl():
runner = CrawlerRunner({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
deferred = runner.crawl(AmazonSpider)
deferred.addCallback(reactor.callLater, 10, run_crawl)
return deferred
run_crawl()
reactor.run()
The process get stuck again in the middle of the parse function .该过程再次卡在解析函数的中间。 I really don't know what to try next .
我真的不知道接下来要尝试什么。 If you have an idea please let me know.
如果您有任何想法,请告诉我。 Thank you in advance .... ( By the way , it is not a duplicate , since the posts on the same subject didn#t solve my problem.
提前谢谢你....(顺便说一句,它不是重复的,因为关于同一主题的帖子没有解决我的问题。
I use apscheduler我使用apscheduler
pip install apscheduler
then然后
# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler
from Demo.spiders.baidu import YourSpider
process = CrawlerProcess(get_project_settings())
scheduler = TwistedScheduler()
scheduler.add_job(process.crawl, 'interval', args=[YourSpider], seconds=10)
scheduler.start()
process.start(False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.