[英]Returning Items in scrapy's start_requests()
I am writing a scrapy spider that takes as input many urls and classifies them into categories (returned as items).我正在编写一个scrapy蜘蛛,它将许多网址作为输入并将它们分类为类别(作为项目返回)。 These URLs are fed to the spider via my crawler's
start_requests()
method.这些 URL 通过我的爬虫的
start_requests()
方法提供给蜘蛛。
Some URLs can be classified without downloading them, so I would like to yield
directly an Item
for them in start_requests()
, which is forbidden by scrapy.一些网址可分为无需下载,所以我想
yield
直接的Item
为他们start_requests()
这是由scrapy禁止。 How can I circumvent this?我怎样才能规避这个?
I have thought about catching these requests in a custom middleware that would turn them into spurious Response
objects, that I could then convert into Item
objects in the request callback, but any cleaner solution would be welcome.我曾考虑在自定义中间件中捕获这些请求,将它们转换为虚假的
Response
对象,然后我可以在请求回调中将其转换为Item
对象,但欢迎任何更简洁的解决方案。
You could use Downloader Middleware to do this job.您可以使用下载中间件来完成这项工作。
In start_requests()
, you should always make a request, for example:在
start_requests()
,您应该始终发出请求,例如:
def start_requests(self):
for url in all_urls:
yield scrapy.Request(url)
However, you should write a downloader middleware:但是,您应该编写一个下载器中间件:
class DirectReturn:
def process_request(self, request, spider):
image_url = request.url
if url in direct_return_url_set:
resp = Response(image_url, request=request)
request.meta['direct_return_url': True]
return resp
else:
return request
Then, in your parse
method, just check if key direct_return_url
in response.meta
.然后,在你的
parse
方法,只是检查关键direct_return_url
在response.meta
。 if yes, just generate an item and put response.url to it and then yield this item.如果是,只需生成一个项目并将 response.url 放入其中,然后生成该项目。
I think using a spider middleware and overwriting the start_requests() would be a good start.我认为使用蜘蛛中间件并覆盖 start_requests() 将是一个好的开始。
In your middleware, you should loop over all urls in start_urls, and could use conditional statements to deal with different types of urls.在您的中间件中,您应该遍历 start_urls 中的所有 url,并且可以使用条件语句来处理不同类型的 url。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.