[英]Web-Scraping: moving to next pages using Scrapy for getting all data
我需要从亚马逊上的产品中删除所有评论:
我正在使用 Scrapy 来执行此操作。 然而,下面的代码似乎并没有抓取所有的评论,因为它们被分成了 n 个不同的页面。 人们应该首先点击所有评论,然后点击下一页。 我想知道如何使用 scrapy 或 python 中的其他工具来执行此操作。该产品有 5893 条评论,我无法手动获取此信息。
目前我的代码如下:
import scrapy
from scrapy.crawler import CrawlerProcess
class My_Spider(scrapy.Spider):
name = 'spid'
start_urls = ['https://www.amazon.com/Cascade-ActionPacs-Dishwasher-Detergent-Packaging/dp/B01NGTV4J5/ref=pd_rhf_cr_s_trq_bnd_0_6/130-6831149-4603948?_encoding=UTF8&pd_rd_i=B01NGTV4J5&pd_rd_r=b6f87690-19d7-4dba-85c0-b8f54076705a&pd_rd_w=AgonG&pd_rd_wg=GG9yY&pf_rd_p=4e0a494a-50c5-45f5-846a-abfb3d21ab34&pf_rd_r=QAD0984X543RFMNNPNF2&psc=1&refRID=QAD0984X543RFMNNPNF2']
def parse(self, response):
for row in response.css('div.review'):
item = {}
item['author'] = row.css('span.a-profile-name::text').extract_first()
rating = row.css('i.review-rating > span::text').extract_first().strip().split(' ')[0]
item['rating'] = int(float(rating.strip().replace(',', '.')))
item['title'] = row.css('span.review-title > span::text').extract_first()
yield item
并执行爬虫:
process = CrawlerProcess({
})
process.crawl(My_Spider)
process.start()
你能告诉我是否可以转到下一页并删除所有评论吗? 这应该是存储评论的页面。
使用 url https://www.amazon.com/Cascade-ActionPacs-Dishwasher-Detergent-Packaging/product-reviews/B01NGTV4J5/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=<PUT PAGE NUMBER HERE>
这个:
import scrapy
from scrapy.crawler import CrawlerProcess
class My_Spider(scrapy.Spider):
name = 'spid'
start_urls = ['https://www.amazon.com/Cascade-ActionPacs-Dishwasher-Detergent-Packaging/product-reviews/B01NGTV4J5/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1']
def parse(self, response)
for row in response.css('div.review'):
item = {}
item['author'] = row.css('span.a-profile-name::text').extract_first()
rating = row.css('i.review-rating > span::text').extract_first().strip().split(' ')[0]
item['rating'] = int(float(rating.strip().replace(',', '.')))
item['title'] = row.css('span.review-title > span::text').extract_first()
yield item
next_page = response.css('ul.a-pagination > li.a-last > a::attr(href)').get()
yield scrapy.Request(url=next_page))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.