Scrapy 只從每頁的第一項收集信息，為什么？

Question

我有以下蜘蛛，但它只收集每頁上的第一個項目。

有人可以向我解釋為什么嗎？ 我找不到我的錯誤。

import scrapy

class PerfumesSpider(scrapy.Spider):
    name = 'perfumes'
    allowed_domains = ['www.fragrancenet.com']
    start_urls = ['https://www.fragrancenet.com/fragrances']
    

    def parse(self, response):
        for perfumes in response.xpath("//div[@id='resultSet']"):
            
            #nome = perfumes.xpath(".//span[@class='brand-name']/text()").get(),
            link = perfumes.xpath(".//p[@class='desc']/a/@href").get()
        
            yield response.follow(url=link, callback=self.parse_produto, meta={'link' : link})
 
        next_page = response.xpath("//a[@data-rel='next']/@href").get()
 
        if next_page:
            yield scrapy.Request(url=next_page, callback=self.parse)

    def parse_produto(self, response):
        link = response.request.meta['link']
        for produto in response.xpath("//div[@class='topZone cf']"):

            yield{
            'link': link,
            'gender': produto.xpath(".//span[@class='genderBar desktop']/span/span/text()").get(),
            'name': produto.xpath(".//span[@class='productTitle']/text()").get(),            
            'year': produto.xpath(".//ul[@class='notes cf']/li[3]/span[2]/text()").get(),            
            'brand': produto.xpath("normalize-space(.//p[@class='uDesigner']/a/text())").get(),
            'size': produto.xpath(".//span[@class='sr-only']/text()").getall(),
            'price': produto.xpath(".//div[@class='pricing']/text()").getall(),
            'discount': produto.xpath(".//div[@class='fnet-offer']/a/span/span/text()").get(), 
            }

如果有人可以幫助我，我將不勝感激

Answer 1

您需要 select div 的內部 resultSet，嘗試從這里更改：

for perfumes in response.xpath("//div[@id='resultSet']"):

像這樣的東西（我不確定 xpath 代碼，你必須仔細檢查）：

for perfumes in response.xpath("//div[@id='resultSet']//div[@class='resultItem']"):

Scrapy 只從每頁的第一項收集信息，為什么？

問題描述

1 個解決方案

解決方案1
1 已采納 2020-12-18 16:28:20

Scrapy 只從每頁的第一項收集信息，為什么？

問題描述

1 個解決方案

解決方案1 1 已采納 2020-12-18 16:28:20

解決方案1
1 已采納 2020-12-18 16:28:20