简体   繁体   English

Beautifulsoup/Selenium 如何抓取网站直到下一页被禁用?

[英]Beautifulsoup/Selenium how to scrape website until next page is disabled?

So I have a list of urls (called "data") that contains urls like https://www.amazon.com/Airpods-Fashion-Protective-Accessories-Silicone/product-reviews/B08YD8JLNQ/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews所以我有一个 url 列表(称为“数据”),其中包含https://www.amazon.com/Airpods-Fashion-Protective-Accessories-Silicone/product-reviews/B08YD8JLNQ/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType =all_reviews

and

https://www.amazon.com/Keychain-R-fun-Protective-Accessories-Visible-Sky/product-reviews/B082W7DL1R/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews https://www.amazon.com/Keychain-R-fun-Protective-Accessories-Visible-Sky/product-reviews/B082W7DL1R/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews

Some urls do not have the "Next Page" icon and some do.有些网址没有“下一页”图标,有些则有。 So far my code is something like this到目前为止,我的代码是这样的

from bs4 import BeautifulSoup
import requests
import csv
import os
import pandas as pd
from selenium import webdriver
from selenium.common import exceptions
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException


df = pd.read_csv(r'path to csv file', sep=',', usecols=['Url'], squeeze=True)
data = pd.read_csv(r'path to csv file', sep=',', usecols=['Url'], squeeze=True)
rows = []

for url in data
    page = requests.get(url)
    soup = bs(page.content, 'html.parser')
    soup.prettify
    
    #names = soup.find_all('span', class="a-profile-name")
    # div.celwidget div.aok-relative span.a-profile-name
    #names = soup.find_all('div.celwidget div.aok-relative span', class= "a-profile-name")
    names = soup.find_all('div.celwidget div.aok-relative span.a-profile-name')
    rating = soup.find_all('div.celwidget div.aok-relative span.a-icon-alt')
    title = soup.find_all('div.celwidget div.aok-relative a.a-text-bold span')
    content = soup.find_all('div.celwidget div.aok-relative span.review-text-content span')

I want to scrape the names, ratings and etc from the reviews until the last item where the Next Page button would be disabled.我想从评论中删除名称、评级等,直到最后一个项目下一页按钮将被禁用。 I'm not quite sure what to do from here, I looked around and many questions related to this was using .click() on Next Page which I don't think is the answer I need/want.我不太确定从这里开始做什么,我环顾四周,许多与此相关的问题是在 Next Page 上使用 .click() 我认为这不是我需要/想要的答案。

The next page url is stored in a list item with class name a-last .下一页 url 存储在类名为a-last的列表项中。 So you could create a while loop that breaks if soup.find('li', class_='a-last') returns nothing anymore (ie if the last page has been reached):因此,您可以创建一个while循环,如果soup.find('li', class_='a-last')不再返回任何内容(即,如果已到达最后一页),则该循环会中断:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

url='https://www.amazon.com/Keychain-R-fun-Protective-Accessories-Visible-Sky/product-reviews/B082W7DL1R/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews' #or https://www.amazon.com/s?k=maison+kitsune+airpod+pro+case
wd = webdriver.Chrome('chromedriver',options=options)

while True:
  wd.get(url)
  soup = BeautifulSoup(wd.page_source, "html.parser")
  #store data here

  try:
    url = 'https://www.amazon.com/' + soup.find('li', class_='a-last').find('a', href=True)['href']
    time.sleep(2) #prevent ban
  except:
    break

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM