使用 Python BeautifulSoup 從具有多個同名表的特定頁面中提取數據表

Question

我對 python 和 BeautifulSoup 非常陌生。 我編寫了下面的代碼來調用網站： https://www.baseball-reference.com/leagues/MLB-standings.shtml ，目的是刮掉底部名為“MLB 詳細排名”的表格並導出到CSV 文件。 我的代碼成功地創建了一個 CSV 文件，但提取了錯誤的數據表，並且缺少包含團隊名稱的第一列。 我的代碼將“東區”表格拉到頂部（不包括第一列），而不是我的目標表格，底部有完整的“MLB 詳細排名”表格。

想知道是否有一種簡單的方法可以將 MLB 詳細排名表拉到底部。 當我檢查頁面時，我試圖提取的特定表的 ID 是：“expanded_standings_overall”。 我需要在我的代碼中引用它嗎？ 或者，將不勝感激任何其他重新編寫代碼以提取正確表的指導。 再次，我很新，並盡我所能學習。

import requests
import csv
import datetime
from bs4 import BeautifulSoup

# static urls
season = datetime.datetime.now().year
URL = "https://www.baseball-reference.com/leagues/MLB-standings.shtml".format(season=season)

# request the data
batting_html = requests.get(URL).text

def parse_array_from_fangraphs_html(input_html, out_file_name):
    """
    Take a HTML stats page from fangraphs and parse it out to a CSV file.
    """
    # parse input
    soup = BeautifulSoup(input_html, "lxml")
    table = soup.find("table", class_=["sortable,", "stats_table", "now_sortable"])

    # get headers
    headers_html = table.find("thead").find_all("th")
    headers = []
    for header in headers_html:
        headers.append(header.text)
    print(headers)

    # get rows
    rows = []
    rows_html = table.find_all("tr")
    for row in rows_html:
        row_data = []
        for cell in row.find_all("td"):
            row_data.append(cell.text)
        rows.append(row_data)

    # write to CSV file
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file)
        writer.writerow(headers)
        writer.writerows(rows)

parse_array_from_fangraphs_html(batting_html, 'BBRefTest.csv')

Answer 1

首先，是的，最好引用 ID，因為您會懷疑開發人員已將此 ID 設為此表的唯一 ID，而 class 只是樣式描述符。

現在，問題更深了。 快速查看頁面代碼實際上表明定義表格的 html 被注釋掉了上面的幾個標簽。 我懷疑腳本在客戶端（在您的瀏覽器中）“啟用”了此代碼。 requests.get只是拉出 html 而不處理任何 javascript 沒有抓住它（您可以檢查batting_html的內容來驗證）。

一個非常快速而骯臟的修復方法是捕獲注釋掉的代碼並在 BeautifulSoup 中重新處理它：

from bs4 import Comment
...

# parse input
soup = BeautifulSoup(input_html, "lxml")
dynamic_content = soup.find("div", id="all_expanded_standings_overall")
comments = dynamic_content.find(string=lambda text: isinstance(text, Comment))
table = BeautifulSoup(comments, "lxml")

# get headers

順便說一句，您想在編寫文件時指定 utf8 編碼...

with open(out_file_name, "w", encoding="utf8") as out_file:
    writer = csv.writer(out_file)
    ...

現在這真的“又快又臟”，我會嘗試更深入地檢查 html 代碼和 javascript 在將其擴展到其他頁面之前真正發生了什么。

使用 Python BeautifulSoup 從具有多個同名表的特定頁面中提取數據表

問題描述

1 個解決方案

解決方案1
1 已采納 2020-05-12 21:17:17

使用 Python BeautifulSoup 從具有多個同名表的特定頁面中提取數據表

問題描述

1 個解決方案

解決方案1 1 已采納 2020-05-12 21:17:17

解決方案1
1 已采納 2020-05-12 21:17:17