使用 requests-html 库在 python 中抓取 web 站点，当被 beautifulsoup 选择时，它没有获得所有元素

Question

尝试通过下面的代码段使用 python 抓取https://edition.cnn.com/world 。 问题是当使用BeautifulSoup解析内容时，我没有得到我想要的所有数据。 我得到了 20 个左右的元素，但还有更多的项目应该被选中

from requests_html import HTMLSession
from bs4 import BeautifulSoup as bs

url = "https://edition.cnn.com/world"
s = HTMLSession()
response = s.get(url)
response.html.render(wait=20)
soup = bs(response.content, 'html.parser')
results = soup.select('div.cd__wrapper')
print(len(results))  # returns 20 or so

基本上我应该使用 selenium 但由于不仅有这个网站，它可能会变得很麻烦。 显然，该网站在加载时使用了一些 javascript，因此导致了此问题。 我想知道这里的调整是什么，或者是否可以在不被迫使用 selenium 的情况下做到这一点

Answer 1

这是因为无论您使用什么库或模块来拉取 html 标签，都可能无法获取所有标签。 不幸的是，除非我运行您的代码，否则无法分辨。

1.) 标签在数组中，所以你必须枚举

或者

2.) beautifulsoup 的 HTMLSession 有问题

尝试使用from urllib.request import urlopen as uReq

如何使用的例子：

xClient = uReq(YOUR_URL) 
Raw_html = xClient.read()
xClient.close()

确保在使用后关闭连接。

Answer 2

恐怕为每个新页面找到一个新的调整而不是仅仅使用 selenium 来获取 html 会变得相当烦人。

原则上，您可以分别调用单独调用相应内容管理器的单独请求，这样您就selenium contentmanagers但是您还必须对每个其他页面进行此类调整，这会花费时间和根本不稳定。

以防万一，你可以用BeautifulSoup处理 html 以防万一：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

service = Service(executable_path='C:\Program Files\ChromeDriver\chromedriver.exe')
driver = webdriver.Chrome(service=service)
driver.get('https://edition.cnn.com/world')

soup = BeautifulSoup(driver.page_source,'html.parser' )
len(soup.select('.cd__wrapper'))

Output --> 116

使用 requests-html 库在 python 中抓取 web 站点，当被 beautifulsoup 选择时，它没有获得所有元素

问题描述

2 个解决方案

解决方案1
0 2022-01-02 10:02:41

解决方案2
0 2022-01-02 10:53:49

使用 requests-html 库在 python 中抓取 web 站点，当被 beautifulsoup 选择时，它没有获得所有元素

问题描述

2 个解决方案

解决方案1 0 2022-01-02 10:02:41

解决方案2 0 2022-01-02 10:53:49

解决方案1
0 2022-01-02 10:02:41

解决方案2
0 2022-01-02 10:53:49