Beautiful Soup Python findAll 返回空列表

Question

我正在嘗試抓取 Amazon Alexa 技能： https : //www.amazon.com/PayPal/dp/B075764QCX/ref=sr_1_1? dchild =1& keywords = paypal & qid = 1604026451 & s = digital-skills & sr =1-1

現在，我只是想獲取技能的名稱（Paypal），但由於某種原因，它返回了一個空列表。 我查看了網站的檢查元素，我知道它應該給我名字，所以我不確定出了什么問題。 我的代碼如下：

request = Request(skill_url, headers=request_headers)
response = urlopen(request)
response = response.read()
html = response.decode()
soup = BeautifulSoup(html, 'html.parser')

name = soup.find_all("h1", {"class" : "a2s-title-content"})

Answer 1

頁面內容是用javascript加載的，所以你不能只用BeautifulSoup來抓取它。 您必須使用另一個模塊（如selenium來模擬javascript執行。

下面是一個例子：

from bs4 import BeautifulSoup as soup
from selenium import webdriver

url='YOUR URL'

driver = webdriver.Firefox()
driver.get(url)

page = driver.page_source
page_soup = soup(page,'html.parser')

containers = page_soup.find_all("h1", {"class" : "a2s-title-content"})
print(containers)
print(len(containers))

您還可以使用chrome-driver或edge-driver請參見此處

Answer 2

嘗試設置User-Agent和Accept-Language HTTP 標頭以防止服務器向您發送驗證碼頁面：

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0',
    'Accept-Language': 'en-US,en;q=0.5'
}

url = 'https://www.amazon.com/PayPal/dp/B075764QCX/ref=sr_1_1?dchild=1&keywords=paypal&qid=1604026451&s=digital-skills&sr=1-1'

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'lxml')
name = soup.find("h1", {"class" : "a2s-title-content"})
print(name.get_text(strip=True))

印刷：

PayPal

Beautiful Soup Python findAll 返回空列表

問題描述

2 個解決方案

解決方案1
1 2020-10-30 04:45:30

解決方案2
0 已采納 2020-10-30 09:14:48

Beautiful Soup Python findAll 返回空列表

問題描述

2 個解決方案

解決方案1 1 2020-10-30 04:45:30

解決方案2 0 已采納 2020-10-30 09:14:48

解決方案1
1 2020-10-30 04:45:30

解決方案2
0 已采納 2020-10-30 09:14:48