[英]Scrapy returns only the first product for each page
我嘗試學習 Scrapy 但我被問題困住了 3 天,也許你們中的一些人可以幫助我提供解決方案或建議。 我嘗試從站點的一個類別中提取所有產品,對於每個產品,我只需要 3 個類別:設置所有產品時主頁上的 2 個類別,以及產品詳細信息中的 1 個類別(這是產品鱈魚); 為此,我訪問了每個產品的鏈接。 所有提取的產品 go 到 ItemLoader。 在項目文件中,我為所有項目使用了 MapComposer 和 TakeFirst 處理器。
問題是我的代碼只從每個頁面中提取第一個產品。
這是代碼:
import os
import scrapy
from scrapy.loader import ItemLoader
from ..items import CutotulItem
class CutotulSpiderLoader(scrapy.Spider):
name = 'cutotul_spider_loader'
start_urls = ['https://cutotul.ro/39-karcher-aspiratoare-profesionale']
def __init__(self):
self.model = ""
def start_requests(self):
yield scrapy.Request('https://cutotul.ro/39-karcher-aspiratoare-profesionale', callback=self.parse)
def parse(self, response):
products = response.css("div.columns-container")
for product in products:
# get details link
details_link = product.xpath("//a[@class='lnk_view btn btn-default']/@href").get()
# get details
yield response.follow(url=details_link, callback=self.parse_details)
product_name_xpath = "//span[@class='grid-name']/text()"
product_price_xpath = "//span[@class='price product-price']/text()"
product_model_xpath = "".join(self.model)
# loader
loader = ItemLoader(item=CutotulItem(), selector=product, response=response)
loader.add_xpath("product_name", product_name_xpath)
loader.add_xpath("product_price", product_price_xpath)
loader.add_value("product_model", product_model_xpath)
yield loader.load_item()
# nav to next page
# Get the next response for x items from the next page - persist until no more #
next_page = response.xpath("//li[@class='pagination_next']//@href").get()
if next_page:
yield response.follow(url=next_page, callback=self.parse)
def parse_details(self, response):
# set variable to response for model
self.model = response.css("span[itemprop='sku']").css("::text").get()
我怎么解決這個問題?
非常感謝!
.//a
而不是//a
)。products
只是這些項目的容器。parse_details
不起作用 - scrapy 是異步的,因此它不會等待self.model
更新,您將得到一個空字符串。 我將其作為示例進行了修復,但您可以根據需要進行修復。import os
import scrapy
from scrapy.loader import ItemLoader
# from ..items import CutotulItem
class CutotulItem(scrapy.Item):
product_name = scrapy.Field()
product_price = scrapy.Field()
product_model = scrapy.Field()
class CutotulSpiderLoader(scrapy.Spider):
name = 'cutotul_spider_loader'
start_urls = ['https://cutotul.ro/39-karcher-aspiratoare-profesionale']
def __init__(self):
self.model = ""
def start_requests(self):
yield scrapy.Request('https://cutotul.ro/39-karcher-aspiratoare-profesionale', callback=self.parse)
async def parse(self, response):
# products = response.css("div.columns-container")
products = response.css('div.product-container')
for product in products:
# get details link
details_link = product.xpath(".//a[@class='lnk_view btn btn-default']/@href").get()
# get details
# yield response.follow(url=details_link, callback=self.parse_details)
req = response.follow(url=details_link)
resp = await self.crawler.engine.download(req, self)
self.model = resp.css("span[itemprop='sku']").css("::text").get()
product_name_xpath = ".//span[@class='grid-name']/text()"
product_price_xpath = ".//span[@class='price product-price']/text()"
product_model_xpath = "".join(self.model)
# loader
loader = ItemLoader(item=CutotulItem(), selector=product, response=response)
loader.add_xpath("product_name", product_name_xpath)
loader.add_xpath("product_price", product_price_xpath)
loader.add_value("product_model", product_model_xpath)
yield loader.load_item()
# nav to next page
# Get the next response for x items from the next page - persist until no more #
next_page = response.xpath("//li[@class='pagination_next']//@href").get()
if next_page:
yield response.follow(url=next_page, callback=self.parse)
def parse_details(self, response):
# set variable to response for model
self.model = response.css("span[itemprop='sku']").css("::text").get()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.