简体   繁体   English

在scrapy的start_requests()中返回项目

[英]Returning Items in scrapy's start_requests()

I am writing a scrapy spider that takes as input many urls and classifies them into categories (returned as items).我正在编写一个scrapy蜘蛛,它将许多网址作为输入并将它们分类为类别(作为项目返回)。 These URLs are fed to the spider via my crawler's start_requests() method.这些 URL 通过我的爬虫的start_requests()方法提供给蜘蛛。

Some URLs can be classified without downloading them, so I would like to yield directly an Item for them in start_requests() , which is forbidden by scrapy.一些网址可分为无需下载,所以我想yield直接的Item为他们start_requests()这是由scrapy禁止。 How can I circumvent this?我怎样才能规避这个?

I have thought about catching these requests in a custom middleware that would turn them into spurious Response objects, that I could then convert into Item objects in the request callback, but any cleaner solution would be welcome.我曾考虑在自定义中间件中捕获这些请求,将它们转换为虚假的Response对象,然后我可以在请求回调中将其转换为Item对象,但欢迎任何更简洁的解决方案。

You could use Downloader Middleware to do this job.您可以使用下载中间件来完成这项工作。

In start_requests() , you should always make a request, for example:start_requests() ,您应该始终发出请求,例如:

def start_requests(self):
    for url in all_urls:
        yield scrapy.Request(url)

However, you should write a downloader middleware:但是,您应该编写一个下载器中间件:

class DirectReturn:
    def process_request(self, request, spider):
        image_url = request.url
        if url in direct_return_url_set:
            resp = Response(image_url, request=request)
            request.meta['direct_return_url': True]
            return resp
        else:
            return request

Then, in your parse method, just check if key direct_return_url in response.meta .然后,在你的parse方法,只是检查关键direct_return_urlresponse.meta if yes, just generate an item and put response.url to it and then yield this item.如果是,只需生成一个项目并将 response.url 放入其中,然后生成该项目。

I think using a spider middleware and overwriting the start_requests() would be a good start.我认为使用蜘蛛中间件并覆盖 start_requests() 将是一个好的开始。

In your middleware, you should loop over all urls in start_urls, and could use conditional statements to deal with different types of urls.在您的中间件中,您应该遍历 start_urls 中的所有 url,并且可以使用条件语句来处理不同类型的 url。

  • For your special URLs which do not require a request, you can对于不需要请求的特殊 URL,您可以
    • directly call your pipeline's process_item(), do not forget to import your pipeline and create a scrapy.item from your url for this直接调用您的管道的 process_item(),不要忘记导入您的管道并为此从您的 url 创建一个 scrapy.item
    • as you mentioned, pass the url as meta in a Request, and have a separate parse function which would only return the url正如你所提到的,在请求中将 url 作为元传递,并有一个单独的解析函数,它只会返回 url
  • For all remaining URLs, your can launch a "normal" Request as you probably already have defined对于所有剩余的 URL,您可以启动“正常”请求,因为您可能已经定义了

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM