I'm new to scrapy and to python and I'm having a hard time understanding the flow. i cant understand where to place the "crawl to next page" function. I'm not sure if it should come after i callback to parse_data or in the parse_data function it self
the script logic: for category in categories, scrape all pages in category.
option 1:
import scrapy
class Amazon01Spider(scrapy.Spider):
name = 'amazon0.1'
allowed_domains = ['amazon.com']
start_urls = ['https://amazon.com/Books/s?ie=UTF8&page=1&rh=n%3A283155&srs=9187220011']
def parse(self, response):
cats = response.xpath('//*[@id="leftNavContainer"]//*[@class="a-unordered-list a-nostyle a-vertical s-ref-indent-two"]//li//@href').extract()
for cat in cats:
yield scrapy.Request("https://amazon.com/"+""+cat, callback = self.parse_data)
def parse_data(self, response):
items = response.xpath('//*[@class="a-fixed-left-grid-col a-col-right"]')
for item in items:
name = item.xpath('.//*[@class="a-row a-spacing-small"]/div/a/h2/text()').extract_first()
yield {'Name': name}
next_page_url = response.xpath('//*[@class="pagnLink"]/a/@href').extract_first()
yield scrapy.Request("https://amazon.com/"+""+next_page_url, callback = self.parse_data)
option 2:
import scrapy
class Amazon01Spider(scrapy.Spider):
name = 'amazon0.1'
allowed_domains = ['amazon.com']
start_urls = ['https://amazon.com/Books/s?ie=UTF8&page=1&rh=n%3A283155&srs=9187220011']
def parse(self, response):
cats = response.xpath('//*[@id="leftNavContainer"]//*[@class="a-unordered-list a-nostyle a-vertical s-ref-indent-two"]//li//@href').extract()
for cat in cats:
yield scrapy.Request("https://amazon.com/"+""+cat, callback = self.parse_data)
next_page_url = response.xpath('//*[@class="pagnLink"]/a/@href').extract_first()
yield scrapy.Request("https://amazon.com/"+""+next_page_url)
def parse_data(self, response):
items = response.xpath('//*[@class="a-fixed-left-grid-col a-col-right"]')
for item in items:
name = item.xpath('.//*[@class="a-row a-spacing-small"]/div/a/h2/text()').extract_first()
yield {'Name': name}
In your specific example I would choose option 1 as it exactly follows your intended script logic. In general, if there are more ways to achieve the goal, I prefer to follow some kind of top-down principle, ie start with main page, extract data from that page, if possible yield request to next page / other top-level pages, and at the end yield requests to lower-level pages. There are couple of reasons to do so. First it's more error prone as pagination takes place in the upper-level method and if there are any errors (parsing etc.) in lower-level methods, your pagination might not even happen. Also, this way you may overcome needless duplicate requests filtering.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.