Scrapy 爬蟲不跟隨鏈接

Question

我在試圖弄清楚為什么我的輔助 function 未能遵循新鏈接和 output 數據時遇到了一些麻煩。 parse function 工作得很好。 當它回調parse_puppy時，什么都沒有發生。 當我檢查 json output 時，我看到puppy的所有內容都已成功抓取，但parse_puppy沒有任何內容。

在第 28 行，如果我更改要follow的方法，我會得到結果，但大約十幾次都是相同的結果。

代碼：

import scrapy
from scrapy.cmdline import execute

class Spider(scrapy.Spider):
    name = "puppyDetails"

    def start_requests(self):
        urls = ['https://ws.petango.com/webservices/adoptablesearch/wsAdoptableAnimals.aspx?species=Dog&gender=A&agegroup=UnderYear&location=&site=&onhold=A&orderby=name&colnum=3&css=http://ws.petango.com/WebServices/adoptablesearch/css/styles.css&authkey=io53xfw8b0k2ocet3yb83666507n2168taf513lkxrqe681kf8&recAmount=&detailsInPopup=No&featuredPet=Include&stageID=&wmode=opaque']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # GRAB ALL TOPICAL PUPPY DETAILS
        for animal in response.css("div.list-animal-info-block"):
            yield {
                'puppy_name': animal.css('div.list-animal-name a::text').get(),
                'puppy_id': animal.css('div.list-animal-id::text').get(),
                'puppy_sex': animal.css('div.list-animal-sexSN::text').get(),
                'puppy_breed': animal.css('div.list-animal-breed::text').get(),
                'puppy_age': animal.css('div.list-animal-age::text').get(),
                'puppy_link': animal.css('div.list-animal-name a::attr(href)').get()
            }

            # DIVE INTO DETAILS PAGE
            detail_page = response.css('div.list-animal-name a::attr(href)').get()
            self.logger.info('get puppy details')
            # GO TO THE PUPPY DETAILS
            yield response.follow_all(detail_page, callback=self.parse_puppy)

    def parse_puppy(self, response):
        # GRAB PUPPY DETAILS
        for puppyDetails in response.xpath('//*[@class="detail-table"]//tr'):
            yield {
                'puppy_id': puppyDetails.xpath('//*[@id="lblID"]/text()').extract(),
                'puppy_status': puppyDetails.xpath('//*[@id="lblStage"]/text()').extract(),
                'puppy_intake_date': puppyDetails.xpath('//*[@id="lblIntakeDate"]/text()').extract()
            }

execute(['scrapy','crawl','puppyDetails'])

錯誤：

ERROR: Spider must return Request, BaseItem, dict or None, got 'generator' in <GET https://ws.petango.com/webservices/adoptablesearch/wsAdoptableAnimals.aspx?species=Dog&gender=A&agegroup=UnderYear&location=&site=&onhold=A&orderby=name&colnum=3&css=http://ws.petango.com/WebServices/adoptablesearch/css/styles.css&authkey=io53xfw8b0k2ocet3yb83666507n2168taf513lkxrqe681kf8&recAmount=&detailsInPopup=No&featuredPet=Include&stageID=&wmode=opaque>

Answer 1

該行應該是

yield from response.follow_all(detail_page, callback=self.parse_puppy)

Scrapy 爬蟲不跟隨鏈接

問題描述

1 個解決方案

解決方案1
1 2020-04-15 22:36:05

Scrapy 爬蟲不跟隨鏈接

問題描述

1 個解決方案

解決方案1 1 2020-04-15 22:36:05

解決方案1
1 2020-04-15 22:36:05