简体   繁体   中英

Scrapy gets only 24 first items of page

I tried many ways to scrape ikea page and I figured out that at last page ikea actually shows all the items. But when I try to scrape last page of ikea's product it only returns me the 24 first items (which corresponds to the items displayed for the first page. this is the URL of the page: https://www.ikea.com/fr/fr/cat/lits-bm003/?page=12

and this is the spider:

import scrapy
import pprint

class SpiderSpider(scrapy.Spider):
    name = 'Ikea'
    pages = 9
    start_urls = ['https://www.ikea.com/fr/fr/cat/canapes-fu003/?page=12']

    def parse(self, response):
        data = {}
        products = response.css('div.plp-product-list')
        for product in products:
            for p in product.css('div.range-revamp-product-compact'):
                yield {
                    'Title' : p.css('div.range-revamp-header-section__title--small::text').getall()[0],
                    'Price' : p.css('span.range-revamp-price__integer::text').getall()[0],
                    'Desc' : p.css('span.range-revamp-header-section__description-text::text').getall()[0],
                    'Img' : p.css('img.range-revamp-aspect-ratio-image__image::attr(src)').getall()[0]
                }

Scrapy's spider doesn't run JavaScript (that's the job of a browser), it will only load the same response content as a cURL would.

To do what exactly you suggest, you need a browser-based solution, like Selenium (Python) or Cypress (JavaScript). Either that or go through each page separately. Try to use a 'headless browser'.

There are probably better ways of doing this, but to address your exact question, this is the intended answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM