[英](Python) Scrapy - How to scrape a JS dropdown list?
我想抓取此地址“大小”部分的javascript列表:
我要做的是获取库存大小,它将返回一个列表。 我该怎么做?
这是我的完整代码:
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request
class ShoesSpider(Spider):
name = "shoes"
allowed_domains = ["store.nike.com"]
start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']
def parse(self, response):
shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
for shoe in shoes:
yield Request(shoe, callback=self.parse_shoes)
def parse_shoes(self, response):
name = response.xpath('//*[@itemprop="name"]/text()').extract_first()
price = response.xpath('//*[@itemprop="price"]/text()').extract_first()
#sizes = ??
yield {
'name' : name,
'price' : price,
'sizes' : sizes
}
谢谢
这是提取库存尺寸的代码。
import scrapy
class ShoesSpider(scrapy.Spider):
name = "shoes"
allowed_domains = ["store.nike.com"]
start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']
def parse(self, response):
sizes = response.xpath('//*[@class="nsg-form--drop-down exp-pdp-size-dropdown exp-pdp-dropdown two-column-dropdown"]/option')
for s in sizes:
size = s.xpath('text()[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]').extract_first('').strip()
yield{'Size':size}
结果如下:
M 4 / W 5.5
M 4.5 /白6
M 6.5 /宽8
M 7 / 8.5
M 7.5 /宽9
M 8 / 9.5
M 8.5 /宽10
M 9 /宽10.5
在for循环中,如果我们这样写,它将提取所有大小,无论它们是否有库存。
size = s.xpath('text()').extract_first('').strip()
但是,如果您只想购买那些有库存的产品,则将它们标记为“ exp-pdp-size-not-in-in-stock selectBox-disabled”类,您必须通过添加以下内容来排除它们:
[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]
我已经在其他鞋类页面上对其进行了测试,并且效果也不错。
尺寸是通过AJAX调用加载的。
因此,您将不得不对该AJAX URL进行另一个请求,以抓取大小。
这是完整的工作代码。 (我没有运行代码,但是我确定它可以正常工作)
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request
import json
class ShoesSpider(Spider):
name = "shoes"
allowed_domains = ["store.nike.com"]
start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']
def parse(self, response):
shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
for shoe in shoes:
yield Request(shoe, callback=self.parse_shoes)
def parse_shoes(self, response):
data = {}
data['name'] = response.xpath('//*[@itemprop="name"]/text()').extract_first()
data['price'] = response.xpath('//*[@itemprop="price"]/text()').extract_first()
#sizes = ??
sizes_url = "http://store.nike.com/html-services/templateData/pdpData?action=getPage&path=%2Fus%2Fen_us%2Fpd%2Fmagista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat%2Fpid-11229710%2Fpgid-11918119&productId=11229710&productGroupId=11918119&catalogId=100701&cache=true&country=US&lang_locale=en_US"
yield Request(url = sizes_url, callback=self.parse_sizes, meta={'data':data})
def parse_shoes(self, response):
resp = json.loads(response.body)
data = response.meta['data']
sizes = resp['response']['pdpData']['skuContainer']['productSkus']
sizesArray = []
for a in sizes:
sizesArray.extend([a["displaySize"]])
yield {
'name' : data['name'],
'price' : data['price'],
'sizes' : sizesArray}
注意:
每个产品的sizes_url
会有所不同,因此您将不得不花一些时间来查看其需要的参数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.