Beautiful Soup 无法解析完整的网站 HTML 代码

Question

这是我正在处理的用于抓取网站数据的代码的一部分。

page = 'https://www.pro-football-reference.com/boxscores/200409090nwe.htm'
sub_data = requests.get(page).text
sub_soup = bs4.BeautifulSoup(sub_data, "html.parser")

for toss in sub_soup.findAll('table', {'class':'suppress_all sortable stats_table now_sortable'}):
print(toss)

即使那行代码不正确，我也尝试了更通用的代码来尝试定位我正在寻找的数据

for toss in sub_soup.findAll('td', {'class':'center'}):
print(toss)

我试图从“游戏信息”表中提取一行文本（谁赢得了投掷 - “赢得了投掷”） - 在这种情况下，答案应该是“爱国者”。 由于某种原因，sub_soup 中缺少游戏信息表的 HTML 的整个部分。 我也尝试使用不同的解析器，比如html5lib 。 sub_soup 中引用了一个部分（您可以通过检查站点中的行来查看），但不是 HTML 格式。 此部分缺少在网站上看到的实际 HTML 代码等。 任何人都可以帮忙吗？

Answer 1

我喜欢处理体育数据。 我以前在专业参考网站上遇到过这个问题。 表格是在之后渲染的，因此在大多数情况下，您需要使用 Selenium 让它渲染或如上所述，然后可以拉取 html 源。 但这不是必需的，因为大多数表格都在初始 html 响应的注释中。 您可以使用 BeautifulSoup 拉出评论，然后搜索<table>标签。

每当我看到或需要拉<table>标签时，我也更喜欢使用 pandas。 Pandas 在引擎盖下使用 beautifulsoup，然后完成大部分工作。 您需要做的就是在需要时操作表格。

这将创建一个表列表，只需拉出您想要的表，它位于索引 position 1中：

代码：

import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd


url = 'https://www.pro-football-reference.com/boxscores/200409090nwe.htm'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))

tables = []
for each in comments:
    if 'table' in each:
        try:
            tables.append(pd.read_html(each)[0])
        except:
            continue

Output：

print (tables[1])
            0                                                  1
0   Game Info                                          Game Info
1    Won Toss                                           Patriots
2        Roof                                           outdoors
3     Surface                                              grass
4     Weather  73 degrees, relative humidity 99%, wind 19 mph...
5  Vegas Line                          New England Patriots -3.0
6  Over/Under                                        44.5 (over)

Beautiful Soup 无法解析完整的网站 HTML 代码

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-10-01 10:42:50

Beautiful Soup 无法解析完整的网站 HTML 代码

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-10-01 10:42:50

解决方案1
0 已采纳 2019-10-01 10:42:50