简体   繁体   English

scrapy.Request 不进入下载中间件,它返回 Request 而不是 response

[英]scrapy.Request doesn't enter the download middleware, it returns Request instead of response

I'm using scrapy.Spider to scrape , and I want to use request inside my callback function which is in start_requests, but that request didn't work, it should return a response but it only returns Request.我正在使用scrapy.Spider 进行scrape ,我想在start_requests 中的回调函数中使用request,但是该请求不起作用,它应该返回一个响应,但它只返回Request。

I followed the debug breakpoint and found that in class Request(object_ref), the request only finished the initialization but it didn't go into request = next(slot.start_requests) as expected, to start requesting, thus only returning Request.我跟着调试断点,发现在类Request(object_ref)中,请求只完成了初始化,但没有像预期的那样进入request = next(slot.start_requests),开始请求,因此只返回Request。

Here is my code in brief:这是我的代码:

class ProjSpider(scrapy.Spider):
    name = 'Proj'
    allowed_domains = ['mashable.com']

    def start_requests(self):
        # pages
        pages = 10
        for i in range(1, pages):
            url = "https://mashable.com/channeldatafeed/Tech/new/page/"+str(i)
            yield scrapy.Request(url, callback=self.parse_mashable)

Request works fine yet and following is:请求工作正常,以下是:

    def parse_mashable(self, response):
        item = Item()
        json2parse = response.text
        json_response = json.loads(json2parse)
        d = json_response['dataFeed'] # a list containing dicts, in which there is url for detailed article
        for data in d:
            item_url = data['url'] # the url for detailed article
            item_response = self.get_response_mashable(item_url)
            # here I want to parse the item_response to get detail
            item['content'] = item_response.xpath("//body").get
            yield item

    def get_response_mashable(self,url):
        response = scrapy.Request(url) 
        # using self.parser. I've also defined my own parser and yield an item
        # but the problem is it never got to callback
        return response # tried yield also but failed

this is where Request doesn't work.这是请求不起作用的地方。 The url is in the allowed_domains, and it's not a duplicate url. url 在 allowed_domains 中,它不是重复的 url。 I'm guessing it's because of scrapy's asynchronous mechanism of Request, but how could it affect the request in self.parse_mashable, by then the Request in start_requests is already finished.我猜是scrapy的Request的异步机制造成的,但是怎么会影响self.parse_mashable中的request,到那时start_requests中的Request已经完成了。 I managed to do the second request in python Requests-html, but still I couldn't figure out why.我设法在 python Requests-html 中执行了第二个请求,但我仍然不知道为什么。

So could anyone help pointing where I'm doing wrong?那么有人可以帮忙指出我做错的地方吗? Thx in advance!提前谢谢!

Scrapy doesn't really expect you to do this the way you're trying to, so it doesn't have a simple way to do it. Scrapy 并不真正期望您按照您尝试的方式执行此操作,因此它没有简单的方法来执行此操作。

What you should be doing instead is passing the data you've scraped from the original page to the new callback using the request's meta dict.相反,您应该做的是使用请求的meta字典将您从原始页面抓取的数据传递给新的回调。

For details, check Passing additional data to callback functions .有关详细信息,请检查将附加数据传递给回调函数

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM