简体   繁体   English

运行scrapy搜寻器的最简单方法,因此它不会阻止脚本

[英]Easiest way to run scrapy crawler so it doesn't block the script

The official docs give many ways for running scrapy crawlers from code: 官方的文档给运行许多方面scrapy从代码爬虫:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

But all of them block script until crawling is finished. 但是它们全部阻止脚本,直到爬网完成为止。 What's the easiest way in python to run the crawler in a non-blocking, async manner? 在python中以非阻塞,异步方式运行搜寻器的最简单方法是什么?

I tried every solution I could find, and the only working for me was this . 我尝试了所有可以找到的解决方案,而对我来说唯一可行的就是这个 But in order to make it work with scrapy 1.1rc1 I had to tweak it a little bit: 但是为了使其与scrapy 1.1rc1我必须对其进行一些调整:

from scrapy.crawler import Crawler
from scrapy import signals
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from billiard import Process

class CrawlerScript(Process):
    def __init__(self, spider):
        Process.__init__(self)
        settings = get_project_settings()
        self.crawler = Crawler(spider.__class__, settings)
        self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        self.spider = spider

    def run(self):
        self.crawler.crawl(self.spider)
        reactor.run()

def crawl_async():
    spider = MySpider()
    crawler = CrawlerScript(spider)
    crawler.start()
    crawler.join()

So now when I call crawl_async , it starts crawling and doesn't block my current thread. 因此,现在当我调用crawl_async ,它将开始爬网并且不会阻塞我的当前线程。 I'm absolutely new to scrapy , so may be this isn't a very good solution but it worked for me. 我绝对不scrapy ,所以可能这不是一个很好的解决方案,但对我scrapy

I used these versions of the libraries: 我使用了这些版本的库:

cffi==1.5.0
Scrapy==1.1rc1
Twisted==15.5.0
billiard==3.3.0.22

Netimen's answer is correct. Netimen的答案是正确的。 process.start() calls reactor.run() , which blocks the thread. process.start()调用reactor.run() ,该线程阻塞线程。 Just that I don't think it is necessary to subclass billiard.Process . 只是我认为没有必要继承billiard.Process Although poorly documented, billiard.Process does have a set of APIs to call another function asynchronously without subclassing. 尽管文献记录billiard.Processbilliard.Process确实有一组API可以异步调用另一个函数而无需子类化。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from billiard import Process

crawler = CrawlerProcess(get_project_settings())
process = Process(target=crawler.start, stop_after_crawl=False)


def crawl(*args, **kwargs):
    crawler.crawl(*args, **kwargs)
    process.start()

Note that if you don't have stop_after_crawl=False , you may run into ReactorNotRestartable exception when you run the crawler for more than twice. 请注意,如果您没有stop_after_crawl=False ,那么在运行ReactorNotRestartable器两次以上时,可能会遇到ReactorNotRestartable异常。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM