[英]scrapy.Request doesn't enter the download middleware, it returns Request instead of response
I'm using scrapy.Spider to scrape , and I want to use request inside my callback function which is in start_requests, but that request didn't work, it should return a response but it only returns Request.我正在使用scrapy.Spider 进行scrape ,我想在start_requests 中的回调函数中使用request,但是该请求不起作用,它应该返回一个响应,但它只返回Request。
I followed the debug breakpoint and found that in class Request(object_ref), the request only finished the initialization but it didn't go into request = next(slot.start_requests) as expected, to start requesting, thus only returning Request.我跟着调试断点,发现在类Request(object_ref)中,请求只完成了初始化,但没有像预期的那样进入request = next(slot.start_requests),开始请求,因此只返回Request。
Here is my code in brief:这是我的代码:
class ProjSpider(scrapy.Spider):
name = 'Proj'
allowed_domains = ['mashable.com']
def start_requests(self):
# pages
pages = 10
for i in range(1, pages):
url = "https://mashable.com/channeldatafeed/Tech/new/page/"+str(i)
yield scrapy.Request(url, callback=self.parse_mashable)
Request works fine yet and following is:请求工作正常,以下是:
def parse_mashable(self, response):
item = Item()
json2parse = response.text
json_response = json.loads(json2parse)
d = json_response['dataFeed'] # a list containing dicts, in which there is url for detailed article
for data in d:
item_url = data['url'] # the url for detailed article
item_response = self.get_response_mashable(item_url)
# here I want to parse the item_response to get detail
item['content'] = item_response.xpath("//body").get
yield item
def get_response_mashable(self,url):
response = scrapy.Request(url)
# using self.parser. I've also defined my own parser and yield an item
# but the problem is it never got to callback
return response # tried yield also but failed
this is where Request doesn't work.这是请求不起作用的地方。 The url is in the allowed_domains, and it's not a duplicate url.
url 在 allowed_domains 中,它不是重复的 url。 I'm guessing it's because of scrapy's asynchronous mechanism of Request, but how could it affect the request in self.parse_mashable, by then the Request in start_requests is already finished.
我猜是scrapy的Request的异步机制造成的,但是怎么会影响self.parse_mashable中的request,到那时start_requests中的Request已经完成了。 I managed to do the second request in python Requests-html, but still I couldn't figure out why.
我设法在 python Requests-html 中执行了第二个请求,但我仍然不知道为什么。
So could anyone help pointing where I'm doing wrong?那么有人可以帮忙指出我做错的地方吗? Thx in advance!
提前谢谢!
Scrapy doesn't really expect you to do this the way you're trying to, so it doesn't have a simple way to do it. Scrapy 并不真正期望您按照您尝试的方式执行此操作,因此它没有简单的方法来执行此操作。
What you should be doing instead is passing the data you've scraped from the original page to the new callback using the request's meta
dict.相反,您应该做的是使用请求的
meta
字典将您从原始页面抓取的数据传递给新的回调。
For details, check Passing additional data to callback functions .有关详细信息,请检查将附加数据传递给回调函数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.