在scrapy的start_requests()中返回项目

Question

I am writing a scrapy spider that takes as input many urls and classifies them into categories (returned as items).我正在编写一个scrapy蜘蛛，它将许多网址作为输入并将它们分类为类别（作为项目返回）。 These URLs are fed to the spider via my crawler's start_requests() method.这些 URL 通过我的爬虫的start_requests()方法提供给蜘蛛。

Some URLs can be classified without downloading them, so I would like to yield directly an Item for them in start_requests() , which is forbidden by scrapy.一些网址可分为无需下载，所以我想yield直接的Item为他们start_requests()这是由scrapy禁止。 How can I circumvent this?我怎样才能规避这个？

I have thought about catching these requests in a custom middleware that would turn them into spurious Response objects, that I could then convert into Item objects in the request callback, but any cleaner solution would be welcome.我曾考虑在自定义中间件中捕获这些请求，将它们转换为虚假的Response对象，然后我可以在请求回调中将其转换为Item对象，但欢迎任何更简洁的解决方案。

Answer 1

You could use Downloader Middleware to do this job.您可以使用下载中间件来完成这项工作。

In start_requests() , you should always make a request, for example:在start_requests() ，您应该始终发出请求，例如：

def start_requests(self):
    for url in all_urls:
        yield scrapy.Request(url)

However, you should write a downloader middleware:但是，您应该编写一个下载器中间件：

class DirectReturn:
    def process_request(self, request, spider):
        image_url = request.url
        if url in direct_return_url_set:
            resp = Response(image_url, request=request)
            request.meta['direct_return_url': True]
            return resp
        else:
            return request

Then, in your parse method, just check if key direct_return_url in response.meta .然后，在你的parse方法，只是检查关键direct_return_url在response.meta 。 if yes, just generate an item and put response.url to it and then yield this item.如果是，只需生成一个项目并将 response.url 放入其中，然后生成该项目。

Answer 2

I think using a spider middleware and overwriting the start_requests() would be a good start.我认为使用蜘蛛中间件并覆盖 start_requests() 将是一个好的开始。

In your middleware, you should loop over all urls in start_urls, and could use conditional statements to deal with different types of urls.在您的中间件中，您应该遍历 start_urls 中的所有 url，并且可以使用条件语句来处理不同类型的 url。

For your special URLs which do not require a request, you can对于不需要请求的特殊 URL，您可以
- directly call your pipeline's process_item(), do not forget to import your pipeline and create a scrapy.item from your url for this直接调用您的管道的 process_item()，不要忘记导入您的管道并为此从您的 url 创建一个 scrapy.item
- as you mentioned, pass the url as meta in a Request, and have a separate parse function which would only return the url正如你所提到的，在请求中将 url 作为元传递，并有一个单独的解析函数，它只会返回 url
For all remaining URLs, your can launch a "normal" Request as you probably already have defined对于所有剩余的 URL，您可以启动“正常”请求，因为您可能已经定义了

在scrapy的start_requests()中返回项目

问题描述

2 个解决方案

解决方案1
2 2021-02-25 06:25:43

解决方案2
1 已采纳 2016-02-10 14:00:10

在scrapy的start_requests()中返回项目

问题描述

2 个解决方案

解决方案1 2 2021-02-25 06:25:43

解决方案2 1 已采纳 2016-02-10 14:00:10

解决方案1
2 2021-02-25 06:25:43

解决方案2
1 已采纳 2016-02-10 14:00:10