使用 Python BeautifulSoup 从具有多个同名表的特定页面中提取数据表

Question

我对 python 和 BeautifulSoup 非常陌生。 我编写了下面的代码来调用网站： https://www.baseball-reference.com/leagues/MLB-standings.shtml ，目的是刮掉底部名为“MLB 详细排名”的表格并导出到CSV 文件。 我的代码成功地创建了一个 CSV 文件，但提取了错误的数据表，并且缺少包含团队名称的第一列。 我的代码将“东区”表格拉到顶部（不包括第一列），而不是我的目标表格，底部有完整的“MLB 详细排名”表格。

想知道是否有一种简单的方法可以将 MLB 详细排名表拉到底部。 当我检查页面时，我试图提取的特定表的 ID 是：“expanded_standings_overall”。 我需要在我的代码中引用它吗？ 或者，将不胜感激任何其他重新编写代码以提取正确表的指导。 再次，我很新，并尽我所能学习。

import requests
import csv
import datetime
from bs4 import BeautifulSoup

# static urls
season = datetime.datetime.now().year
URL = "https://www.baseball-reference.com/leagues/MLB-standings.shtml".format(season=season)

# request the data
batting_html = requests.get(URL).text

def parse_array_from_fangraphs_html(input_html, out_file_name):
    """
    Take a HTML stats page from fangraphs and parse it out to a CSV file.
    """
    # parse input
    soup = BeautifulSoup(input_html, "lxml")
    table = soup.find("table", class_=["sortable,", "stats_table", "now_sortable"])

    # get headers
    headers_html = table.find("thead").find_all("th")
    headers = []
    for header in headers_html:
        headers.append(header.text)
    print(headers)

    # get rows
    rows = []
    rows_html = table.find_all("tr")
    for row in rows_html:
        row_data = []
        for cell in row.find_all("td"):
            row_data.append(cell.text)
        rows.append(row_data)

    # write to CSV file
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file)
        writer.writerow(headers)
        writer.writerows(rows)

parse_array_from_fangraphs_html(batting_html, 'BBRefTest.csv')

Answer 1

首先，是的，最好引用 ID，因为您会怀疑开发人员已将此 ID 设为此表的唯一 ID，而 class 只是样式描述符。

现在，问题更深了。 快速查看页面代码实际上表明定义表格的 html 被注释掉了上面的几个标签。 我怀疑脚本在客户端（在您的浏览器中）“启用”了此代码。 requests.get只是拉出 html 而不处理任何 javascript 没有抓住它（您可以检查batting_html的内容来验证）。

一个非常快速而肮脏的修复方法是捕获注释掉的代码并在 BeautifulSoup 中重新处理它：

from bs4 import Comment
...

# parse input
soup = BeautifulSoup(input_html, "lxml")
dynamic_content = soup.find("div", id="all_expanded_standings_overall")
comments = dynamic_content.find(string=lambda text: isinstance(text, Comment))
table = BeautifulSoup(comments, "lxml")

# get headers

顺便说一句，您想在编写文件时指定 utf8 编码...

with open(out_file_name, "w", encoding="utf8") as out_file:
    writer = csv.writer(out_file)
    ...

现在这真的“又快又脏”，我会尝试更深入地检查 html 代码和 javascript 在将其扩展到其他页面之前真正发生了什么。

使用 Python BeautifulSoup 从具有多个同名表的特定页面中提取数据表

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-12 21:17:17

使用 Python BeautifulSoup 从具有多个同名表的特定页面中提取数据表

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-12 21:17:17

解决方案1
1 已采纳 2020-05-12 21:17:17