简体   繁体   中英

Web-Scraping: moving to next pages using Scrapy for getting all data

I would need to scrape all the reviews from a product on Amazon:

https://www.amazon.com/Cascade-ActionPacs-Dishwasher-Detergent-Packaging/dp/B01NGTV4J5/ref=pd_rhf_cr_s_trq_bnd_0_6/130-6831149-4603948?_encoding=UTF8&pd_rd_i=B01NGTV4J5&pd_rd_r=b6f87690-19d7-4dba-85c0-b8f54076705a&pd_rd_w=AgonG&pd_rd_wg=GG9yY&pf_rd_p=4e0a494a-50c5-45f5-846a-abfb3d21ab34&pf_rd_r=QAD0984X543RFMNNPNF2&psc=1&refRID=QAD0984X543RFMNNPNF2

I am using Scrapy to do this. However it seems that the following code is not scraping all the reviews, as they are split n different pages. A human should click on all reviews first, the click on next page. I am wondering how I could do this using scrapy or a different tool in python. There are 5893 reviews for this product and I cannot get this information manually.

Currently my code is the following:

import scrapy
from scrapy.crawler import CrawlerProcess

class My_Spider(scrapy.Spider):
    name = 'spid'
    start_urls = ['https://www.amazon.com/Cascade-ActionPacs-Dishwasher-Detergent-Packaging/dp/B01NGTV4J5/ref=pd_rhf_cr_s_trq_bnd_0_6/130-6831149-4603948?_encoding=UTF8&pd_rd_i=B01NGTV4J5&pd_rd_r=b6f87690-19d7-4dba-85c0-b8f54076705a&pd_rd_w=AgonG&pd_rd_wg=GG9yY&pf_rd_p=4e0a494a-50c5-45f5-846a-abfb3d21ab34&pf_rd_r=QAD0984X543RFMNNPNF2&psc=1&refRID=QAD0984X543RFMNNPNF2']

    def parse(self, response):
        for row in response.css('div.review'):
            item = {}

            item['author'] = row.css('span.a-profile-name::text').extract_first()

            rating = row.css('i.review-rating > span::text').extract_first().strip().split(' ')[0]
            item['rating'] = int(float(rating.strip().replace(',', '.')))

            item['title'] = row.css('span.review-title > span::text').extract_first()
            yield item

And to execute the crawler:

process = CrawlerProcess({
})

process.crawl(My_Spider)
process.start() 

Can you tell me if it is possible to move to next pages and scrape all the reviews? This should be the page where are stored the reviews.

With the url https://www.amazon.com/Cascade-ActionPacs-Dishwasher-Detergent-Packaging/product-reviews/B01NGTV4J5/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=<PUT PAGE NUMBER HERE> you could do something like this:

import scrapy
from scrapy.crawler import CrawlerProcess

class My_Spider(scrapy.Spider):
    name = 'spid'
    start_urls = ['https://www.amazon.com/Cascade-ActionPacs-Dishwasher-Detergent-Packaging/product-reviews/B01NGTV4J5/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1']

    def parse(self, response)
         for row in response.css('div.review'):
             item = {}
             item['author'] = row.css('span.a-profile-name::text').extract_first()
             rating = row.css('i.review-rating > span::text').extract_first().strip().split(' ')[0]
             item['rating'] = int(float(rating.strip().replace(',', '.')))
             item['title'] = row.css('span.review-title > span::text').extract_first()
             yield item
         next_page = response.css('ul.a-pagination > li.a-last > a::attr(href)').get()
         yield scrapy.Request(url=next_page))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM