无法使用漂亮的汤选择特定的html元素

Question

我试图找到一个嵌套在all_totals id内的tbody元素（我已经检查了绝对存在）。

import requests
from bs4 import BeautifulSoup, Comment

url = 'https://www.basketball-reference.com/players/a/abdelal01.html'
data = requests.get(url)
html = BeautifulSoup(data.text, 'html.parser')

print(html.select('#all_totals tbody').prettify())

但是，这个漂亮的汤代码仅返回一个空数组。 我认为该问题可能是由位于GIANT html注释下的所需元素引起的。 我添加了一些代码来尝试解析html以摆脱注释：

for comment in html.findAll(text=lambda text: isinstance(text, Comment)):
    comment.extract()
print(html.select('#all_totals')[0].prettify())

这样可以消除评论。 但是，嵌套在“ all_totals” ID中的大多数（但不是全部）html在执行此操作后便消失了。

我在做什么错，如何正确选择所需的html？

Answer 1

您可以使用selenium直接查找tbody ，因为它是由javascript加载的。

尝试这个：

from bs4 import BeautifulSoup, Comment
from selenium import webdriver

url = 'https://www.basketball-reference.com/players/a/abdelal01.html'
driver = webdriver.Firefox()
driver.get(url)
html = BeautifulSoup(driver.page_source)

print(html.find('div', {'id':'all_totals'}).find('tbody').prettify())

for comment in html.findAll(text=lambda text: isinstance(text, Comment)):
    comment.extract()
print(html.find('div', {'id': 'all_totals'}).prettify())

Answer 2

您不想使用extract因为您将删除包含感兴趣的html的注释。 请参阅以下示例，以代替从注释中提取内容

import pandas as pd

for comment in html.findAll(text=lambda text: isinstance(text, Comment)):
    if 'id="totals"' in comment:
        table = pd.read_html(comment)[0]
        print(table)
        break

无法使用漂亮的汤选择特定的html元素

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-08-16 07:56:10

解决方案2
1 2019-08-16 08:26:24

无法使用漂亮的汤选择特定的html元素

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-08-16 07:56:10

解决方案2 1 2019-08-16 08:26:24

解决方案1
1 已采纳 2019-08-16 07:56:10

解决方案2
1 2019-08-16 08:26:24