简体   繁体   中英

How to use threading in Scrapy/Twisted, i.e. how to do async calls to blocking code in response callbacks?

I need run some multi-thread\\multiprocessing work (because I have some library which uses blocking call) in Scrapy, and after its completion put back Request to Scrapy engine.

I need something like this:

def blocking_call(self, html):
    # ....
    # do some work in blocking call
    return Request(url)

def parse(self, response):
    return self.blocking_call(response.body)

How I can do that? I think I should to use Twisted reactor and Deferred object. But Scrapy parse callback must return only None or Request or BaseItem object.

Based on answer from @Jean-Paul Calderone I did some investigation and testing and here is what I have found out.

Internally scrapy uses Twisted framework for managing request/response sync and async calls.

Scrapy spawns requests (crawling) in async manner, but processing responses (our custom parse callback functions) are done synchronous . So if you have blocking call in a callback, it will block the whole engine .

Hopefully this can be changed. When processing Deferred response callback result, Twisted handles the case (twisted.internet.defer.Deferred source) if Deferred object returns other Deferred object. In that case Twisted yields new async call.

Basically, if we return Deferred object from our response callback , this will change nature of response callback call from sync to async . For that we can use method deferToThread ( internally calls deferToThreadPool(reactor, reactor.getThreadPool()... - which was used in @Jean-Paul Calderone code example).

The working code example is:

from twisted.internet.threads import deferToThread
from twisted.internet import reactor

class SpiderWithBlocking(...):
    ...
    def parse(self, response):
        return deferToThread(reactor, self.blocking_call, response.body)

    def blocking_call(self, html):
        # ....
        # do some work in blocking call
        return Request(url)

Additionally, only callbacks can return Deferred objects, but start_requests can not (scrapy logic).

If you want to return a Deferred that fires after your blocking operation has finished running in one of the reactor's thread pool threads, use deferToThreadPool :

from twisted.internet.threads import deferToThreadPool
from twisted.internet import reactor

...

    def parse(self, response):
        return deferToThreadPool(
            reactor, reactor.getThreadPool(), self.blocking_call, response.body)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM