簡體   English   中英

Scrapy 每頁只返回第一個產品

[英]Scrapy returns only the first product for each page

我嘗試學習 Scrapy 但我被問題困住了 3 天,也許你們中的一些人可以幫助我提供解決方案或建議。 我嘗試從站點的一個類別中提取所有產品,對於每個產品,我只需要 3 個類別:設置所有產品時主頁上的 2 個類別,以及產品詳細信息中的 1 個類別(這是產品鱈魚); 為此,我訪問了每個產品的鏈接。 所有提取的產品 go 到 ItemLoader。 在項目文件中,我為所有項目使用了 MapComposer 和 TakeFirst 處理器。

問題是我的代碼只從每個頁面中提取第一個產品。

這是代碼:

import os
import scrapy
from scrapy.loader import ItemLoader
from ..items import CutotulItem


class CutotulSpiderLoader(scrapy.Spider):
    name = 'cutotul_spider_loader'
    start_urls = ['https://cutotul.ro/39-karcher-aspiratoare-profesionale']

    def __init__(self):
        self.model = ""

    def start_requests(self):
        yield scrapy.Request('https://cutotul.ro/39-karcher-aspiratoare-profesionale', callback=self.parse)

    def parse(self, response):
        products = response.css("div.columns-container")
        for product in products:
            # get details link
            details_link = product.xpath("//a[@class='lnk_view btn btn-default']/@href").get()

            # get details
            yield response.follow(url=details_link, callback=self.parse_details)
            product_name_xpath = "//span[@class='grid-name']/text()"
            product_price_xpath = "//span[@class='price product-price']/text()"
            product_model_xpath = "".join(self.model)

            # loader
            loader = ItemLoader(item=CutotulItem(), selector=product, response=response)
            loader.add_xpath("product_name", product_name_xpath)
            loader.add_xpath("product_price", product_price_xpath)
            loader.add_value("product_model", product_model_xpath)
            yield loader.load_item()

        # nav to next page
        # Get the next response for x items from the next page - persist until no more #

        next_page = response.xpath("//li[@class='pagination_next']//@href").get()
        if next_page:
            yield response.follow(url=next_page, callback=self.parse)

    def parse_details(self, response):
        # set variable to response for model
        self.model = response.css("span[itemprop='sku']").css("::text").get()

我怎么解決這個問題?

非常感謝!

  1. 您需要為每個產品使用相對 xpath(例如.//a而不是//a )。
  2. 您不是在遍歷項目, products只是這些項目的容器。
  3. parse_details不起作用 - scrapy 是異步的,因此它不會等待self.model更新,您將得到一個空字符串。 我將其作為示例進行了修復,但您可以根據需要進行修復。
  4. 剩下要做的是檢查分頁是否有效(我沒有檢查),並修復(如果需要)價格的格式。
import os
import scrapy
from scrapy.loader import ItemLoader
# from ..items import CutotulItem


class CutotulItem(scrapy.Item):
    product_name = scrapy.Field()
    product_price = scrapy.Field()
    product_model = scrapy.Field()


class CutotulSpiderLoader(scrapy.Spider):
    name = 'cutotul_spider_loader'
    start_urls = ['https://cutotul.ro/39-karcher-aspiratoare-profesionale']

    def __init__(self):
        self.model = ""

    def start_requests(self):
        yield scrapy.Request('https://cutotul.ro/39-karcher-aspiratoare-profesionale', callback=self.parse)

    async def parse(self, response):
        # products = response.css("div.columns-container")
        products = response.css('div.product-container')
        for product in products:
            # get details link
            details_link = product.xpath(".//a[@class='lnk_view btn btn-default']/@href").get()

            # get details
            # yield response.follow(url=details_link, callback=self.parse_details)
            req = response.follow(url=details_link)
            resp = await self.crawler.engine.download(req, self)
            self.model = resp.css("span[itemprop='sku']").css("::text").get()

            product_name_xpath = ".//span[@class='grid-name']/text()"
            product_price_xpath = ".//span[@class='price product-price']/text()"
            product_model_xpath = "".join(self.model)

            # loader
            loader = ItemLoader(item=CutotulItem(), selector=product, response=response)
            loader.add_xpath("product_name", product_name_xpath)
            loader.add_xpath("product_price", product_price_xpath)
            loader.add_value("product_model", product_model_xpath)
            yield loader.load_item()

        # nav to next page
        # Get the next response for x items from the next page - persist until no more #

        next_page = response.xpath("//li[@class='pagination_next']//@href").get()
        if next_page:
            yield response.follow(url=next_page, callback=self.parse)

    def parse_details(self, response):
        # set variable to response for model
        self.model = response.css("span[itemprop='sku']").css("::text").get()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM