简体   繁体   English

从Scrapy管道发出非阻塞HTTP请求

[英]Making a Non-Blocking HTTP Request from Scrapy Pipeline

As I understand it, Scrapy is single threaded but async on the network side. 据我了解,Scrapy是单线程的,但在网络端是异步的。 I am working on something which requires an API call to an external resource from within the item pipeline. 我正在做一些需要从项目管道内对外部资源进行API调用的事情。 Is there any way to make the HTTP request without blocking the pipeline and slowing down Scrapy from crawling? 有什么方法可以发出HTTP请求而又不会阻塞管道并减慢Scrapy的爬网速度?

Thanks 谢谢

You can do it by scheduling a request directly to a crawler.engine via crawler.engine.crawl(request, spider) . 您可以通过将请求直接通过crawler.engine.crawl(request, spider)调度到crawler.enginecrawler.engine.crawl(request, spider) To do that however fist you need to expose crawler in your pipeline: 要做到这一点,首先需要在管道中公开爬虫:

class MyPipeline(object):
    def __init__(self, crawler):
        self.crawler = crawler

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_item(self, item, spider):
        if item['some_extra_field']:  # check if we already did below
            return item
        url = 'some_url'
        req = scrapy.Request(url, self.parse_item, meta={'item':item})
        self.crawler.engine.crawl(req, spider)
        raise DropItem()  # we will get this item next time

    def parse_item(self, response):
        item = response.meta['item']
        item['some_extra_field'] = '...'
        return item

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM