[英]How to use threading in Scrapy/Twisted, i.e. how to do async calls to blocking code in response callbacks?
I need run some multi-thread\\multiprocessing work (because I have some library which uses blocking call) in Scrapy, and after its completion put back Request to Scrapy engine. 我需要在Scrapy中运行一些多线程\\多处理工作(因为我有一些使用阻塞调用的库),并在完成之后将请求发送回Scrapy引擎。
I need something like this: 我需要这样的东西:
def blocking_call(self, html):
# ....
# do some work in blocking call
return Request(url)
def parse(self, response):
return self.blocking_call(response.body)
How I can do that? 我怎么能这样做? I think I should to use Twisted reactor and Deferred object.
我想我应该使用Twisted reactor和Deferred对象。 But Scrapy
parse
callback must return only None
or Request
or BaseItem
object. 但是Scrapy
parse
回调必须只返回None
或Request
或BaseItem
对象。
Based on answer from @Jean-Paul Calderone I did some investigation and testing and here is what I have found out. 根据@ Jean-Paul Calderone的回答,我做了一些调查和测试,这是我发现的。
Internally scrapy uses Twisted framework for managing request/response sync and async calls. 内部scrapy使用Twisted框架来管理请求/响应同步和异步调用。
Scrapy spawns requests (crawling) in async manner, but processing responses (our custom parse callback functions) are done synchronous . Scrapy以异步方式生成请求 (爬网),但处理响应 (我们的自定义解析回调函数)是同步完成的。 So if you have blocking call in a callback, it will block the whole engine .
因此,如果您在回调中有阻塞调用, 它将阻止整个引擎 。
Hopefully this can be changed. 希望这可以改变。 When processing Deferred response callback result, Twisted handles the case (twisted.internet.defer.Deferred source) if Deferred object returns other Deferred object.
处理延迟响应回调结果时,如果Deferred对象返回其他Deferred对象,Twisted将处理大小写(twisted.internet.defer.Deferred source) 。 In that case Twisted yields new async call.
在这种情况下,Twisted会产生新的异步调用。
Basically, if we return Deferred object from our response callback , this will change nature of response callback call from sync to async . 基本上,如果我们从响应回调中返回Deferred对象 ,这将改变响应回调调用从同步到异步的性质 。 For that we can use method deferToThread ( internally calls
deferToThreadPool(reactor, reactor.getThreadPool()...
- which was used in @Jean-Paul Calderone code example). 为此,我们可以使用方法deferToThread ( 内部调用
deferToThreadPool(reactor, reactor.getThreadPool()...
- 在@ Jean-Paul Calderone代码示例中使用)。
The working code example is: 工作代码示例是:
from twisted.internet.threads import deferToThread
from twisted.internet import reactor
class SpiderWithBlocking(...):
...
def parse(self, response):
return deferToThread(reactor, self.blocking_call, response.body)
def blocking_call(self, html):
# ....
# do some work in blocking call
return Request(url)
Additionally, only callbacks can return Deferred objects, but start_requests
can not (scrapy logic). 此外,只有回调可以返回Deferred对象,但
start_requests
不能(scrapy逻辑)。
If you want to return a Deferred
that fires after your blocking operation has finished running in one of the reactor's thread pool threads, use deferToThreadPool
: 如果要在阻塞操作完成在其中一个reactor的线程池线程中运行后返回触发的
Deferred
,请使用deferToThreadPool
:
from twisted.internet.threads import deferToThreadPool
from twisted.internet import reactor
...
def parse(self, response):
return deferToThreadPool(
reactor, reactor.getThreadPool(), self.blocking_call, response.body)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.