简体   繁体   中英

Beautifulsoup/Selenium how to scrape website until next page is disabled?

So I have a list of urls (called "data") that contains urls like https://www.amazon.com/Airpods-Fashion-Protective-Accessories-Silicone/product-reviews/B08YD8JLNQ/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews

and

https://www.amazon.com/Keychain-R-fun-Protective-Accessories-Visible-Sky/product-reviews/B082W7DL1R/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews

Some urls do not have the "Next Page" icon and some do. So far my code is something like this

from bs4 import BeautifulSoup
import requests
import csv
import os
import pandas as pd
from selenium import webdriver
from selenium.common import exceptions
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException


df = pd.read_csv(r'path to csv file', sep=',', usecols=['Url'], squeeze=True)
data = pd.read_csv(r'path to csv file', sep=',', usecols=['Url'], squeeze=True)
rows = []

for url in data
    page = requests.get(url)
    soup = bs(page.content, 'html.parser')
    soup.prettify
    
    #names = soup.find_all('span', class="a-profile-name")
    # div.celwidget div.aok-relative span.a-profile-name
    #names = soup.find_all('div.celwidget div.aok-relative span', class= "a-profile-name")
    names = soup.find_all('div.celwidget div.aok-relative span.a-profile-name')
    rating = soup.find_all('div.celwidget div.aok-relative span.a-icon-alt')
    title = soup.find_all('div.celwidget div.aok-relative a.a-text-bold span')
    content = soup.find_all('div.celwidget div.aok-relative span.review-text-content span')

I want to scrape the names, ratings and etc from the reviews until the last item where the Next Page button would be disabled. I'm not quite sure what to do from here, I looked around and many questions related to this was using .click() on Next Page which I don't think is the answer I need/want.

The next page url is stored in a list item with class name a-last . So you could create a while loop that breaks if soup.find('li', class_='a-last') returns nothing anymore (ie if the last page has been reached):

from selenium import webdriver
from bs4 import BeautifulSoup
import time

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

url='https://www.amazon.com/Keychain-R-fun-Protective-Accessories-Visible-Sky/product-reviews/B082W7DL1R/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews' #or https://www.amazon.com/s?k=maison+kitsune+airpod+pro+case
wd = webdriver.Chrome('chromedriver',options=options)

while True:
  wd.get(url)
  soup = BeautifulSoup(wd.page_source, "html.parser")
  #store data here

  try:
    url = 'https://www.amazon.com/' + soup.find('li', class_='a-last').find('a', href=True)['href']
    time.sleep(2) #prevent ban
  except:
    break

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM