帶有回調參數的python yield函數

Question

這是我第一次在這里提問。 如果我出錯了，請原諒我。 我是python的新手，學習了一個月，我嘗試使用scrapy學習更多有關Spider的知識。

問題在這里：

    def get_chapterurl(self, response):
       item = DingdianItem()
       item['name'] = str(response.meta['name']).replace('\xa0', '')
       yield item
       yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id})


    def get_chapter(self, response):
       urls = re.findall(r'<td class="L"><a href="(.*?)">(.*?)</a></td>', response.text)

如您所見，我同時產生item和Requests，但是get_chapter函數沒有運行第一行（我在那兒有一個斷點），所以我在哪里錯了？ 對不起，打擾您。 我有一段時間的谷歌，但要注意...

Answer 1

您的請求被過濾掉。

Scrapy具有內置的請求過濾器，可防止您兩次下載同一頁面（預期功能）。

假設您在http://example.com上； 此請求您產生：

yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id})

嘗試再次下載http://example.com 。 而且，如果您查看抓取日志，它應該說出“忽略重復的URL http://example.com ”。
您始終可以通過在Request對象中設置dont_filter=True參數來忽略此功能，如下所示：

yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id},
              dont_filter=True)

然而！ 我在理解您的代碼的意圖時遇到了麻煩，但似乎您真的不想兩次下載相同的URL。
您也不必安排新的請求，只需使用已有的請求調用回調即可：

response = response.replace(meta={'name': name_id})  # update meta
# why crawl it again, if we can just call the callback directly!
# for python2
for result in self.get_chapter(response):  
    yield result
# or if you are running python3:
yield from self.get_chapter(response):

帶有回調參數的python yield函數

問題描述

1 個解決方案

解決方案1
1 已采納 2017-03-18 10:26:45

帶有回調參數的python yield函數

問題描述

1 個解決方案

解決方案1 1 已采納 2017-03-18 10:26:45

解決方案1
1 已采納 2017-03-18 10:26:45