简体   繁体   中英

scrapy crawling categories and pages

I'm new to scrapy and to python and I'm having a hard time understanding the flow. i cant understand where to place the "crawl to next page" function. I'm not sure if it should come after i callback to parse_data or in the parse_data function it self

the script logic: for category in categories, scrape all pages in category.

option 1:

import scrapy

class Amazon01Spider(scrapy.Spider):
    name = 'amazon0.1'
    allowed_domains = ['amazon.com']
    start_urls = ['https://amazon.com/Books/s?ie=UTF8&page=1&rh=n%3A283155&srs=9187220011']


    def parse(self, response):
        cats = response.xpath('//*[@id="leftNavContainer"]//*[@class="a-unordered-list a-nostyle a-vertical s-ref-indent-two"]//li//@href').extract()
        for cat in cats:
            yield scrapy.Request("https://amazon.com/"+""+cat, callback = self.parse_data)


    def parse_data(self, response):
        items = response.xpath('//*[@class="a-fixed-left-grid-col a-col-right"]')
        for item in items:
           name = item.xpath('.//*[@class="a-row a-spacing-small"]/div/a/h2/text()').extract_first()
           yield {'Name': name}

        next_page_url = response.xpath('//*[@class="pagnLink"]/a/@href').extract_first()
        yield scrapy.Request("https://amazon.com/"+""+next_page_url, callback = self.parse_data)

option 2:

import scrapy

class Amazon01Spider(scrapy.Spider):
    name = 'amazon0.1'
    allowed_domains = ['amazon.com']
    start_urls = ['https://amazon.com/Books/s?ie=UTF8&page=1&rh=n%3A283155&srs=9187220011']


    def parse(self, response):
        cats = response.xpath('//*[@id="leftNavContainer"]//*[@class="a-unordered-list a-nostyle a-vertical s-ref-indent-two"]//li//@href').extract()
        for cat in cats:
            yield scrapy.Request("https://amazon.com/"+""+cat, callback = self.parse_data)

        next_page_url = response.xpath('//*[@class="pagnLink"]/a/@href').extract_first()
        yield scrapy.Request("https://amazon.com/"+""+next_page_url)


    def parse_data(self, response):
        items = response.xpath('//*[@class="a-fixed-left-grid-col a-col-right"]')
        for item in items:
           name = item.xpath('.//*[@class="a-row a-spacing-small"]/div/a/h2/text()').extract_first()
           yield {'Name': name}

In your specific example I would choose option 1 as it exactly follows your intended script logic. In general, if there are more ways to achieve the goal, I prefer to follow some kind of top-down principle, ie start with main page, extract data from that page, if possible yield request to next page / other top-level pages, and at the end yield requests to lower-level pages. There are couple of reasons to do so. First it's more error prone as pagination takes place in the upper-level method and if there are any errors (parsing etc.) in lower-level methods, your pagination might not even happen. Also, this way you may overcome needless duplicate requests filtering.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM