简体   繁体   中英

Making a Non-Blocking HTTP Request from Scrapy Pipeline

As I understand it, Scrapy is single threaded but async on the network side. I am working on something which requires an API call to an external resource from within the item pipeline. Is there any way to make the HTTP request without blocking the pipeline and slowing down Scrapy from crawling?

Thanks

You can do it by scheduling a request directly to a crawler.engine via crawler.engine.crawl(request, spider) . To do that however fist you need to expose crawler in your pipeline:

class MyPipeline(object):
    def __init__(self, crawler):
        self.crawler = crawler

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_item(self, item, spider):
        if item['some_extra_field']:  # check if we already did below
            return item
        url = 'some_url'
        req = scrapy.Request(url, self.parse_item, meta={'item':item})
        self.crawler.engine.crawl(req, spider)
        raise DropItem()  # we will get this item next time

    def parse_item(self, response):
        item = response.meta['item']
        item['some_extra_field'] = '...'
        return item

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM