简体   繁体   English

Scrapy xpath选择器不会选择所有HTML标记

[英]Scrapy xpath selector doesnt select all the HTML tags

I am trying to scrape all the product names on https://www.walmart.com/search/?query=ps3&cat_id=0 using the Scrapy python library. 我正在尝试使用Scrapy python库在https://www.walmart.com/search/?query=ps3&cat_id=0上抓取所有产品名称。

This is my parse function 这是我的解析功能

  def parseWalmart(self,response): print("INSIDE PARSE WALMART") for product in response.xpath('//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]'): print(product) product_name = product.xpath('.//div[contains(@class,"search-result-product-title listview")]//a//span//text()').extract() product_page = product.xpath('.//div[contains(@class,"search-result-product-title listview")]//a/@href').extract() product_name=" ".join(product_name) print(product_name) print("-------------------------------------") 

and this is my scrapy request 这是我的要求

    yield scrapy.Request(url=i, callback=self.parseWalmart, headers = {"User-Agent":"Mozilla/5.0"})

However, I am only able to scrape 4 products, when there are actually a dozen of them. 但是,实际上只有十几种产品,我只能抓取4种产品。 I dont understand why. 我不明白为什么。 These are the 4 products that I scraped 这些是我刮的4种产品

 <Selector xpath='//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]' data='<div data-tl-id="ProductTileListView-0">'> ABLEGRID Wireless Bluetooth Game Controller for Sony PS3 Black ------------------------------------- <Selector xpath='//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]' data='<div data-tl-id="ProductTileListView-1">'> Arsenal Gaming PS3 Wired Controller, Black ------------------------------------- <Selector xpath='//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]' data='<div data-tl-id="ProductTileListView-2">'> Refurbished Sony PlayStation 3 Slim 320 GB Charcoal Black Console ------------------------------------- <Selector xpath='//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]' data='<div data-tl-id="ProductTileListView-3">'> Sonic's Ultimate Genesis Collection ( PS3 ) ------------------------------------- 

在此处输入图片说明

Because there are only 4 divs starting with 'ProductTileListView-' in the DOM originally. 因为最初在DOM中只有4个以“ ProductTileListView-”开头的div。 However, you can find all products information in the script of the page. 但是,您可以在页面的脚本中找到所有产品信息。

Here is how I get all information of product 这是我获取产品的所有信息的方式

import re import json data = re.findall("\\"items\\":(.+?),\\"secondaryItems\\"", response.body.decode("utf-8"), re.S) products_json = json.loads(data[0]) len(ls) # return 20 Notice that the products array starts with "items": and ends with ,"secondaryItems". import re import json data = re.findall("\\"items\\":(.+?),\\"secondaryItems\\"", response.body.decode("utf-8"), re.S) products_json = json.loads(data[0]) len(ls) # return 20请注意,products数组以“ items”开头:以“ secondaryItems”结尾。

structure of one product { "productId": "2H53I08Z1K78", "usItemId": "23422902", "productType": "REGULAR", "title": "Watch Dogs (<mark>PS3</mark>)", .... "imageUrl": "https://i5.walmartimages.com/asr/70aecbb1-5dbf-4a64-a86d-134a8fc7edee_2.59805d79db07665c20cc4e4fadc35743.jpeg?odnHeight=180&odnWidth=180&odnBg=ffffff", "productPageUrl": "/ip/Watch-Dogs-PS3/23422902", "upc": "0000888834804", } 一个产品的结构{ "productId": "2H53I08Z1K78", "usItemId": "23422902", "productType": "REGULAR", "title": "Watch Dogs (<mark>PS3</mark>)", .... "imageUrl": "https://i5.walmartimages.com/asr/70aecbb1-5dbf-4a64-a86d-134a8fc7edee_2.59805d79db07665c20cc4e4fadc35743.jpeg?odnHeight=180&odnWidth=180&odnBg=ffffff", "productPageUrl": "/ip/Watch-Dogs-PS3/23422902", "upc": "0000888834804", }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM