简体   繁体   English

(Python)Scrapy-如何抓取JS下拉列表?

[英](Python) Scrapy - How to scrape a JS dropdown list?

I want to scrape the javascript list of the 'size' section of this address: 我想抓取此地址“大小”部分的javascript列表:

http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119 http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119

What I want to do is get the sizes that are in stock, it will return a list. 我要做的是获取库存大小,它将返回一个列表。 How would I be able to do it? 我该怎么做?

Here's my full code: 这是我的完整代码:

# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request

class ShoesSpider(Spider):
    name = "shoes"
    allowed_domains = ["store.nike.com"]
    start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']

    def parse(self, response):       
        shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
        for shoe in shoes:
            yield Request(shoe, callback=self.parse_shoes) 

    def parse_shoes(self, response):
        name = response.xpath('//*[@itemprop="name"]/text()').extract_first()
        price = response.xpath('//*[@itemprop="price"]/text()').extract_first()
        #sizes = ??

        yield {
            'name' : name,
            'price' : price,
            'sizes' : sizes
        }

Thanks 谢谢

Here is the code to extract sizes in stock. 这是提取库存尺寸的代码。

import scrapy


class ShoesSpider(scrapy.Spider):
    name = "shoes"
    allowed_domains = ["store.nike.com"]
    start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']

    def parse(self, response):
        sizes = response.xpath('//*[@class="nsg-form--drop-down exp-pdp-size-dropdown exp-pdp-dropdown two-column-dropdown"]/option')


        for s in sizes:
            size = s.xpath('text()[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]').extract_first('').strip()
            yield{'Size':size}


Here is the result: 结果如下:

M 4 / W 5.5 M 4 / W 5.5
M 4.5 / W 6 M 4.5 /白6
M 6.5 / W 8 M 6.5 /宽8
M 7 / W 8.5 M 7 / 8.5
M 7.5 / W 9 M 7.5 /宽9
M 8 / W 9.5 M 8 / 9.5
M 8.5 / W 10 M 8.5 /宽10
M 9 / W 10.5 M 9 /宽10.5

In the for loop, if we write it like this, it will extract all the sizes, whether they are in stock or not. 在for循环中,如果我们这样写,它将提取所有大小,无论它们是否有库存。

size = s.xpath('text()').extract_first('').strip()


But if you want to get those that are in stock only, they are marked with the class "exp-pdp-size-not-in-stock selectBox-disabled" which you have to exclude through adding this: 但是,如果您只想购买那些有库存的产品,则将它们标记为“ exp-pdp-size-not-in-in-stock selectBox-disabled”类,您必须通过添加以下内容来排除它们:

[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]



I have tested it on other shoe pages, and it works as well. 我已经在其他鞋类页面上对其进行了测试,并且效果也不错。

Sizes are being loaded by an AJAX call. 尺寸是通过AJAX调用加载的。

So you will have to make another request to that AJAX URL in order to scrape Sizes. 因此,您将不得不对该AJAX URL进行另一个请求,以抓取大小。

Here is fully working code. 这是完整的工作代码。 (I have not run code on my side but I am sure its working) (我没有运行代码,但是我确定它可以正常工作)

# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request
import json

class ShoesSpider(Spider):
    name = "shoes"
    allowed_domains = ["store.nike.com"]
    start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']

    def parse(self, response):       
        shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
        for shoe in shoes:
            yield Request(shoe, callback=self.parse_shoes) 

    def parse_shoes(self, response):
        data = {}
        data['name'] = response.xpath('//*[@itemprop="name"]/text()').extract_first()
        data['price'] = response.xpath('//*[@itemprop="price"]/text()').extract_first()
        #sizes = ??


        sizes_url = "http://store.nike.com/html-services/templateData/pdpData?action=getPage&path=%2Fus%2Fen_us%2Fpd%2Fmagista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat%2Fpid-11229710%2Fpgid-11918119&productId=11229710&productGroupId=11918119&catalogId=100701&cache=true&country=US&lang_locale=en_US"
        yield Request(url = sizes_url, callback=self.parse_sizes, meta={'data':data}) 


        def parse_shoes(self, response):

            resp = json.loads(response.body)

            data = response.meta['data']

            sizes = resp['response']['pdpData']['skuContainer']['productSkus']

            sizesArray = []

            for a in sizes:
                sizesArray.extend([a["displaySize"]])

            yield {
            'name' : data['name'],
            'price' : data['price'],
            'sizes' : sizesArray}

NOTE: 注意:

The sizes_url will be different for each product, so you will have to spend some time to see what parameters it takes. 每个产品的sizes_url会有所不同,因此您将不得不花一些时间来查看其需要的参数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM