[英]Beautiful Soup Python findAll returning empty list
我正在尝试抓取 Amazon Alexa 技能: https : //www.amazon.com/PayPal/dp/B075764QCX/ref=sr_1_1? dchild =1& keywords = paypal & qid = 1604026451 & s = digital-skills & sr =1-1
现在,我只是想获取技能的名称(Paypal),但由于某种原因,它返回了一个空列表。 我查看了网站的检查元素,我知道它应该给我名字,所以我不确定出了什么问题。 我的代码如下:
request = Request(skill_url, headers=request_headers)
response = urlopen(request)
response = response.read()
html = response.decode()
soup = BeautifulSoup(html, 'html.parser')
name = soup.find_all("h1", {"class" : "a2s-title-content"})
页面内容是用javascript加载的,所以你不能只用BeautifulSoup来抓取它。 您必须使用另一个模块(如selenium
来模拟javascript执行。
下面是一个例子:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
url='YOUR URL'
driver = webdriver.Firefox()
driver.get(url)
page = driver.page_source
page_soup = soup(page,'html.parser')
containers = page_soup.find_all("h1", {"class" : "a2s-title-content"})
print(containers)
print(len(containers))
您还可以使用chrome-driver
或edge-driver
请参见此处
尝试设置User-Agent
和Accept-Language
HTTP 标头以防止服务器向您发送验证码页面:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0',
'Accept-Language': 'en-US,en;q=0.5'
}
url = 'https://www.amazon.com/PayPal/dp/B075764QCX/ref=sr_1_1?dchild=1&keywords=paypal&qid=1604026451&s=digital-skills&sr=1-1'
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'lxml')
name = soup.find("h1", {"class" : "a2s-title-content"})
print(name.get_text(strip=True))
印刷:
PayPal
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.