简体   繁体   中英

Scrapy xpath selector doesnt select all the HTML tags

I am trying to scrape all the product names on https://www.walmart.com/search/?query=ps3&cat_id=0 using the Scrapy python library.

This is my parse function

  def parseWalmart(self,response): print("INSIDE PARSE WALMART") for product in response.xpath('//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]'): print(product) product_name = product.xpath('.//div[contains(@class,"search-result-product-title listview")]//a//span//text()').extract() product_page = product.xpath('.//div[contains(@class,"search-result-product-title listview")]//a/@href').extract() product_name=" ".join(product_name) print(product_name) print("-------------------------------------") 

and this is my scrapy request

    yield scrapy.Request(url=i, callback=self.parseWalmart, headers = {"User-Agent":"Mozilla/5.0"})

However, I am only able to scrape 4 products, when there are actually a dozen of them. I dont understand why. These are the 4 products that I scraped

 <Selector xpath='//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]' data='<div data-tl-id="ProductTileListView-0">'> ABLEGRID Wireless Bluetooth Game Controller for Sony PS3 Black ------------------------------------- <Selector xpath='//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]' data='<div data-tl-id="ProductTileListView-1">'> Arsenal Gaming PS3 Wired Controller, Black ------------------------------------- <Selector xpath='//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]' data='<div data-tl-id="ProductTileListView-2">'> Refurbished Sony PlayStation 3 Slim 320 GB Charcoal Black Console ------------------------------------- <Selector xpath='//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]' data='<div data-tl-id="ProductTileListView-3">'> Sonic's Ultimate Genesis Collection ( PS3 ) ------------------------------------- 

在此处输入图片说明

Because there are only 4 divs starting with 'ProductTileListView-' in the DOM originally. However, you can find all products information in the script of the page.

Here is how I get all information of product

import re import json data = re.findall("\\"items\\":(.+?),\\"secondaryItems\\"", response.body.decode("utf-8"), re.S) products_json = json.loads(data[0]) len(ls) # return 20 Notice that the products array starts with "items": and ends with ,"secondaryItems".

structure of one product { "productId": "2H53I08Z1K78", "usItemId": "23422902", "productType": "REGULAR", "title": "Watch Dogs (<mark>PS3</mark>)", .... "imageUrl": "https://i5.walmartimages.com/asr/70aecbb1-5dbf-4a64-a86d-134a8fc7edee_2.59805d79db07665c20cc4e4fadc35743.jpeg?odnHeight=180&odnWidth=180&odnBg=ffffff", "productPageUrl": "/ip/Watch-Dogs-PS3/23422902", "upc": "0000888834804", }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM