I need run some multi-thread\\multiprocessing work (because I have some library which uses blocking call) in Scrapy, and after its completion put back Request to Scrapy engine.
I need something like this:
def blocking_call(self, html):
# ....
# do some work in blocking call
return Request(url)
def parse(self, response):
return self.blocking_call(response.body)
How I can do that? I think I should to use Twisted reactor and Deferred object. But Scrapy parse
callback must return only None
or Request
or BaseItem
object.
Based on answer from @Jean-Paul Calderone I did some investigation and testing and here is what I have found out.
Internally scrapy uses Twisted framework for managing request/response sync and async calls.
Scrapy spawns requests (crawling) in async manner, but processing responses (our custom parse callback functions) are done synchronous . So if you have blocking call in a callback, it will block the whole engine .
Hopefully this can be changed. When processing Deferred response callback result, Twisted handles the case (twisted.internet.defer.Deferred source) if Deferred object returns other Deferred object. In that case Twisted yields new async call.
Basically, if we return Deferred object from our response callback , this will change nature of response callback call from sync to async . For that we can use method deferToThread ( internally calls deferToThreadPool(reactor, reactor.getThreadPool()...
- which was used in @Jean-Paul Calderone code example).
The working code example is:
from twisted.internet.threads import deferToThread
from twisted.internet import reactor
class SpiderWithBlocking(...):
...
def parse(self, response):
return deferToThread(reactor, self.blocking_call, response.body)
def blocking_call(self, html):
# ....
# do some work in blocking call
return Request(url)
Additionally, only callbacks can return Deferred objects, but start_requests
can not (scrapy logic).
If you want to return a Deferred
that fires after your blocking operation has finished running in one of the reactor's thread pool threads, use deferToThreadPool
:
from twisted.internet.threads import deferToThreadPool
from twisted.internet import reactor
...
def parse(self, response):
return deferToThreadPool(
reactor, reactor.getThreadPool(), self.blocking_call, response.body)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.