[英]Making a Non-Blocking HTTP Request from Scrapy Pipeline
As I understand it, Scrapy is single threaded but async on the network side. 据我了解,Scrapy是单线程的,但在网络端是异步的。 I am working on something which requires an API call to an external resource from within the item pipeline.
我正在做一些需要从项目管道内对外部资源进行API调用的事情。 Is there any way to make the HTTP request without blocking the pipeline and slowing down Scrapy from crawling?
有什么方法可以发出HTTP请求而又不会阻塞管道并减慢Scrapy的爬网速度?
Thanks 谢谢
You can do it by scheduling a request directly to a crawler.engine
via crawler.engine.crawl(request, spider)
. 您可以通过将请求直接通过
crawler.engine.crawl(request, spider)
调度到crawler.engine
来crawler.engine.crawl(request, spider)
。 To do that however fist you need to expose crawler in your pipeline: 要做到这一点,首先需要在管道中公开爬虫:
class MyPipeline(object):
def __init__(self, crawler):
self.crawler = crawler
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_item(self, item, spider):
if item['some_extra_field']: # check if we already did below
return item
url = 'some_url'
req = scrapy.Request(url, self.parse_item, meta={'item':item})
self.crawler.engine.crawl(req, spider)
raise DropItem() # we will get this item next time
def parse_item(self, response):
item = response.meta['item']
item['some_extra_field'] = '...'
return item
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.