简体   繁体   English

如何在Scrapy / Twisted中使用线程,即如何在响应回调中执行异步调用阻塞代码?

[英]How to use threading in Scrapy/Twisted, i.e. how to do async calls to blocking code in response callbacks?

I need run some multi-thread\\multiprocessing work (because I have some library which uses blocking call) in Scrapy, and after its completion put back Request to Scrapy engine. 我需要在Scrapy中运行一些多线程\\多处理工作(因为我有一些使用阻塞调用的库),并在完成之后将请求发送回Scrapy引擎。

I need something like this: 我需要这样的东西:

def blocking_call(self, html):
    # ....
    # do some work in blocking call
    return Request(url)

def parse(self, response):
    return self.blocking_call(response.body)

How I can do that? 我怎么能这样做? I think I should to use Twisted reactor and Deferred object. 我想我应该使用Twisted reactor和Deferred对象。 But Scrapy parse callback must return only None or Request or BaseItem object. 但是Scrapy parse回调必须只返回NoneRequestBaseItem对象。

Based on answer from @Jean-Paul Calderone I did some investigation and testing and here is what I have found out. 根据@ Jean-Paul Calderone的回答,我做了一些调查和测试,这是我发现的。

Internally scrapy uses Twisted framework for managing request/response sync and async calls. 内部scrapy使用Twisted框架来管理请求/响应同步和异步调用。

Scrapy spawns requests (crawling) in async manner, but processing responses (our custom parse callback functions) are done synchronous . Scrapy以异步方式生成请求 (爬网),但处理响应 (我们的自定义解析回调函数)是同步完成的。 So if you have blocking call in a callback, it will block the whole engine . 因此,如果您在回调中有阻塞调用, 它将阻止整个引擎

Hopefully this can be changed. 希望这可以改变。 When processing Deferred response callback result, Twisted handles the case (twisted.internet.defer.Deferred source) if Deferred object returns other Deferred object. 处理延迟响应回调结果时,如果Deferred对象返回其他Deferred对象,Twisted将处理大小写(twisted.internet.defer.Deferred source) In that case Twisted yields new async call. 在这种情况下,Twisted会产生新的异步调用。

Basically, if we return Deferred object from our response callback , this will change nature of response callback call from sync to async . 基本上,如果我们从响应回调中返回Deferred对象 ,这将改变响应回调调用从同步到异步的性质 For that we can use method deferToThread ( internally calls deferToThreadPool(reactor, reactor.getThreadPool()... - which was used in @Jean-Paul Calderone code example). 为此,我们可以使用方法deferToThread内部调用 deferToThreadPool(reactor, reactor.getThreadPool()... - 在@ Jean-Paul Calderone代码示例中使用)。

The working code example is: 工作代码示例是:

from twisted.internet.threads import deferToThread
from twisted.internet import reactor

class SpiderWithBlocking(...):
    ...
    def parse(self, response):
        return deferToThread(reactor, self.blocking_call, response.body)

    def blocking_call(self, html):
        # ....
        # do some work in blocking call
        return Request(url)

Additionally, only callbacks can return Deferred objects, but start_requests can not (scrapy logic). 此外,只有回调可以返回Deferred对象,但start_requests不能(scrapy逻辑)。

If you want to return a Deferred that fires after your blocking operation has finished running in one of the reactor's thread pool threads, use deferToThreadPool : 如果要在阻塞操作完成在其中一个reactor的线程池线程中运行后返回触发的Deferred ,请使用deferToThreadPool

from twisted.internet.threads import deferToThreadPool
from twisted.internet import reactor

...

    def parse(self, response):
        return deferToThreadPool(
            reactor, reactor.getThreadPool(), self.blocking_call, response.body)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在同步代码中调用异步 function 并中断异步/等待链(即如何在同步函数中包装异步 function) - How to call async function in sync code and break async/await chain (i.e. how to wrap an async function in a sync function) `如何在scrapy中使用一系列回调 - `How to use a chain of callbacks in scrapy Python-如何以结束线程的方式结束函数(即将threading.activeCount()减少1)? - Python - How to end function in a way that ends thread (i.e. decrease threading.activeCount() by 1)? 用Twisted阻止Thrift通话 - Blocking Thrift calls with Twisted 如何在Scrapy抓取工具中通过请求回调传递响应对象? - How are response objects passed through request callbacks in a Scrapy scraper? 如何停止通过异步调用阻止Flask Route? - How do I stop blocking my Flask Route with an Async call? 如何在Python3.6和CentOs上安装Twisted + Scrapy - How Can I install Twisted + Scrapy on Python3.6 and CentOs 如何使用scrapy re()选择器? - How do I use scrapy re() selector? 如何在scrapy中正确使用XPATH? - How do i use XPATH properly in scrapy? 如何计算多重响应(即“检查所有适用”)问题的出现频率? - Python - How to count the frequency of occurrence from multiple response (i.e. “check all that apply”) questions? - Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM