簡體   English   中英

Beautifulsoup/Selenium 如何抓取網站直到下一頁被禁用?

[英]Beautifulsoup/Selenium how to scrape website until next page is disabled?

所以我有一個 url 列表(稱為“數據”),其中包含https://www.amazon.com/Airpods-Fashion-Protective-Accessories-Silicone/product-reviews/B08YD8JLNQ/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType =all_reviews

https://www.amazon.com/Keychain-R-fun-Protective-Accessories-Visible-Sky/product-reviews/B082W7DL1R/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews

有些網址沒有“下一頁”圖標,有些則有。 到目前為止,我的代碼是這樣的

from bs4 import BeautifulSoup
import requests
import csv
import os
import pandas as pd
from selenium import webdriver
from selenium.common import exceptions
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException


df = pd.read_csv(r'path to csv file', sep=',', usecols=['Url'], squeeze=True)
data = pd.read_csv(r'path to csv file', sep=',', usecols=['Url'], squeeze=True)
rows = []

for url in data
    page = requests.get(url)
    soup = bs(page.content, 'html.parser')
    soup.prettify
    
    #names = soup.find_all('span', class="a-profile-name")
    # div.celwidget div.aok-relative span.a-profile-name
    #names = soup.find_all('div.celwidget div.aok-relative span', class= "a-profile-name")
    names = soup.find_all('div.celwidget div.aok-relative span.a-profile-name')
    rating = soup.find_all('div.celwidget div.aok-relative span.a-icon-alt')
    title = soup.find_all('div.celwidget div.aok-relative a.a-text-bold span')
    content = soup.find_all('div.celwidget div.aok-relative span.review-text-content span')

我想從評論中刪除名稱、評級等,直到最后一個項目下一頁按鈕將被禁用。 我不太確定從這里開始做什么,我環顧四周,許多與此相關的問題是在 Next Page 上使用 .click() 我認為這不是我需要/想要的答案。

下一頁 url 存儲在類名為a-last的列表項中。 因此,您可以創建一個while循環,如果soup.find('li', class_='a-last')不再返回任何內容(即,如果已到達最后一頁),則該循環會中斷:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

url='https://www.amazon.com/Keychain-R-fun-Protective-Accessories-Visible-Sky/product-reviews/B082W7DL1R/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews' #or https://www.amazon.com/s?k=maison+kitsune+airpod+pro+case
wd = webdriver.Chrome('chromedriver',options=options)

while True:
  wd.get(url)
  soup = BeautifulSoup(wd.page_source, "html.parser")
  #store data here

  try:
    url = 'https://www.amazon.com/' + soup.find('li', class_='a-last').find('a', href=True)['href']
    time.sleep(2) #prevent ban
  except:
    break

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM