[英]Scrapy Crawler Doesn't Follow Links
I am having some trouble trying to figure out why my secondary function is failing to follow through to the new link and then output data.我在试图弄清楚为什么我的辅助 function 未能遵循新链接和 output 数据时遇到了一些麻烦。 the
parse
function works just fine. parse
function 工作得很好。 It's when it calls back to parse_puppy
that nothing happens.当它回调
parse_puppy
时,什么都没有发生。 When I check the json output I see that everything from puppy
was successfully scraped, but there's nothing from parse_puppy
.当我检查 json output 时,我看到
puppy
的所有内容都已成功抓取,但parse_puppy
没有任何内容。
On line 28, if I change the method to follow
I get results, but it's the same result about dozen times.在第 28 行,如果我更改要
follow
的方法,我会得到结果,但大约十几次都是相同的结果。
Code:代码:
import scrapy
from scrapy.cmdline import execute
class Spider(scrapy.Spider):
name = "puppyDetails"
def start_requests(self):
urls = ['https://ws.petango.com/webservices/adoptablesearch/wsAdoptableAnimals.aspx?species=Dog&gender=A&agegroup=UnderYear&location=&site=&onhold=A&orderby=name&colnum=3&css=http://ws.petango.com/WebServices/adoptablesearch/css/styles.css&authkey=io53xfw8b0k2ocet3yb83666507n2168taf513lkxrqe681kf8&recAmount=&detailsInPopup=No&featuredPet=Include&stageID=&wmode=opaque']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# GRAB ALL TOPICAL PUPPY DETAILS
for animal in response.css("div.list-animal-info-block"):
yield {
'puppy_name': animal.css('div.list-animal-name a::text').get(),
'puppy_id': animal.css('div.list-animal-id::text').get(),
'puppy_sex': animal.css('div.list-animal-sexSN::text').get(),
'puppy_breed': animal.css('div.list-animal-breed::text').get(),
'puppy_age': animal.css('div.list-animal-age::text').get(),
'puppy_link': animal.css('div.list-animal-name a::attr(href)').get()
}
# DIVE INTO DETAILS PAGE
detail_page = response.css('div.list-animal-name a::attr(href)').get()
self.logger.info('get puppy details')
# GO TO THE PUPPY DETAILS
yield response.follow_all(detail_page, callback=self.parse_puppy)
def parse_puppy(self, response):
# GRAB PUPPY DETAILS
for puppyDetails in response.xpath('//*[@class="detail-table"]//tr'):
yield {
'puppy_id': puppyDetails.xpath('//*[@id="lblID"]/text()').extract(),
'puppy_status': puppyDetails.xpath('//*[@id="lblStage"]/text()').extract(),
'puppy_intake_date': puppyDetails.xpath('//*[@id="lblIntakeDate"]/text()').extract()
}
execute(['scrapy','crawl','puppyDetails'])
Error:错误:
ERROR: Spider must return Request, BaseItem, dict or None, got 'generator' in <GET https://ws.petango.com/webservices/adoptablesearch/wsAdoptableAnimals.aspx?species=Dog&gender=A&agegroup=UnderYear&location=&site=&onhold=A&orderby=name&colnum=3&css=http://ws.petango.com/WebServices/adoptablesearch/css/styles.css&authkey=io53xfw8b0k2ocet3yb83666507n2168taf513lkxrqe681kf8&recAmount=&detailsInPopup=No&featuredPet=Include&stageID=&wmode=opaque>
The line should be该行应该是
yield from response.follow_all(detail_page, callback=self.parse_puppy)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.