简体   繁体   中英

Scrapy crawl category links till product page

Hello i am trying to scrape a ecommerce website which loads data with scroll and load more button I followed this How to scrape website with infinte scrolling? link but when I tried the code it's closing the spider without an products may be the structure has changed and I would like some help to get me starting I am quite new to webscraping.

edit question ok i am scraping this website [link] http://www.jabong.com/women/ which has subcategories , i am trying to scrape all the subcategories products i tried above code but that didnt work for so after doing some research i created a code which works but doesnt satisfy my goal .so far i have tried this

` import scrapy
#from scrapy.exceptions import CloseSpider

from scrapy.spiders import Spider
#from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from koovs.items import product
#from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector    

class propubSpider(scrapy.Spider):
name = 'koovs'
allowed_domains = ['jabong.com']
max_pages = 40

def start_requests(self):
    for i in range(self.max_pages):
         yield scrapy.Request('http://www.jabong.com/women/clothing/tops-tees-shirts/tops/?page=%d' % i, callback=self.parse)

def parse(self, response):
    for sel in response.xpath('//*[@id="catalog-product"]/section[2]'):
        item = product()
        item['price'] = sel.xpath('//*[@class="price"]/span[2]/text()').extract()
        item['image'] = sel.xpath('//*[@class="primary-image thumb loaded"]/img').extract()
        item['title'] = sel.xpath('//*[@data-original-href]/@href').extract()
     `

the above code works for one category and also if i specify the number of pages, the above website has a lot of products for given category and i dont know in how many pages they reside , so i decided to use crawl spider to go through all the categories and product pages and fetch data but i am very new to scrapy any help would be highly appreciated

First thing you need to understand is the DOM structure of websites often change. So a scraper written in the past will may or may not work for you now.

So the best approach while scraping a website is to find a hidden api or a hidden url which can only be seen when you anlayze the network traffic of a website. This not just provide you a reliable solution for scraping but also save bandwidth which is very important while doing Broad crawling as most of the time you don't need to download the whole page.

Let's take the example of the website you are crawling to get more clarity. When you visit this page you can see the button which says Show More Product . Go to the developer tools of your browser and select the network analyzer. When you click on the button you will see the browser sending a GET request to this link . Check the response and you will see list of all the products in the first page. Now when you will analyze this link, you can see it has a parameter page=1 . Change it to page=2 and you will see the list of all products in the second page.

Now go ahead and write the spider. It will be something like:

import scrapy
from scrapy.exceptions import CloseSpider
from scrapy.spider import BaseSpider
from scrapy.http import Request
from jabong.items import product

class aqaqspider(BaseSpider):
    name = "jabong"
    allowed_domains = ["jabong.com"]
    start_urls = [
    "http://www.jabong.com/women/clothing/tops-tees-shirts/tops/?page=1&limit=52&sortField=popularity&sortBy=desc",
]
    page = 1

    def parse(self, response):
        products = response.xpath('//*[@id="catalog-product"]/section[2]/div')
        if not products:
            raise CloseSpider("No more products!")

        for p in products:
            item = product()
            item['price'] = p.xpath('a/div/div[2]/span[@class="standard-price"]/text()').extract()
            item['title'] = p.xpath('a/div/div[1]/text()').extract()
            if item['title']:
                yield item

        self.page += 1
        yield Request(url="http://www.jabong.com/women/clothing/tops-tees-shirts/tops/?page=%d&limit=52&sortField=popularity&sortBy=desc" % self.page,
                  callback=self.parse, 
                  dont_filter=True)  

NB- This example is just for educational purpose. Please refer to the website's Terms and Conditions/Privacy Policy/Robots.txt before crawling/scraping/storing any data from the website.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM