Making a Non-Blocking HTTP Request from Scrapy Pipeline

Question

As I understand it, Scrapy is single threaded but async on the network side. I am working on something which requires an API call to an external resource from within the item pipeline. Is there any way to make the HTTP request without blocking the pipeline and slowing down Scrapy from crawling?

Thanks

Answer 1

You can do it by scheduling a request directly to a crawler.engine via crawler.engine.crawl(request, spider) . To do that however fist you need to expose crawler in your pipeline:

class MyPipeline(object):
    def __init__(self, crawler):
        self.crawler = crawler

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_item(self, item, spider):
        if item['some_extra_field']:  # check if we already did below
            return item
        url = 'some_url'
        req = scrapy.Request(url, self.parse_item, meta={'item':item})
        self.crawler.engine.crawl(req, spider)
        raise DropItem()  # we will get this item next time

    def parse_item(self, response):
        item = response.meta['item']
        item['some_extra_field'] = '...'
        return item

Making a Non-Blocking HTTP Request from Scrapy Pipeline

Question

1 answers

solution1
6 2017-01-12 23:33:05

Making a Non-Blocking HTTP Request from Scrapy Pipeline

Question

1 answers

solution1 6 2017-01-12 23:33:05

solution1
6 2017-01-12 23:33:05