使用 Python Beautifulsoup 從特定頁面提取數據

Question

我對 python 和 BeautifulSoup 很陌生。 我編寫了下面的代碼來嘗試調用網站（ https://www.fangraphs.com/depthcharts.aspx?position=Team ），將表格中的數據抓取並導出到 csv 文件。 我能夠編寫代碼來從網站上的其他表中提取數據，但不是這個特定的表。 它不斷返回：AttributeError：NoneType' object 沒有屬性'find'。 我一直在絞盡腦汁想弄清楚我做錯了什么。 我有錯誤的“類”名稱嗎？ 再次，我很新，並試圖自學。 我一直在通過反復試驗和逆向工程他人的代碼來學習。 這個讓我難住了。 有什么指導嗎？

import requests
import csv
import datetime
from bs4 import BeautifulSoup

# static urls
season = datetime.datetime.now().year
URL = "https://www.fangraphs.com/depthcharts.aspx?position=Team".format(season=season)

# request the data
batting_html = requests.get(URL).text

def parse_array_from_fangraphs_html(input_html, out_file_name):
    """
    Take a HTML stats page from fangraphs and parse it out to a CSV file.
    """
    # parse input
    soup = BeautifulSoup(input_html, "lxml")
    table = soup.find("table", {"class": "tablesoreder, depth_chart tablesorter tablesorter-default"})

    # get headers
    headers_html = table.find("thead").find_all("th")
    headers = []
    for header in headers_html:
        headers.append(header.text)
    print(headers)

    # get rows
    rows = []
    rows_html = table.find("tbody").find_all("tr")
    for row in rows_html:
        row_data = []
        for cell in row.find_all("td"):
            row_data.append(cell.text)
        rows.append(row_data)

    # write to CSV file
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file)
        writer.writerow(headers)
        writer.writerows(rows)

parse_array_from_fangraphs_html(batting_html, 'Team War Totals.csv')

Answer 1

回溯看起來像

AttributeError                            Traceback (most recent call last)
<ipython-input-4-ee944e08f675> in <module>()
     41         writer.writerows(rows)
     42 
---> 43 parse_array_from_fangraphs_html(batting_html, 'Team War Totals.csv')

<ipython-input-4-ee944e08f675> in parse_array_from_fangraphs_html(input_html, out_file_name)
     20 
     21     # get headers
---> 22     headers_html = table.find("thead").find_all("th")
     23     headers = []
     24     for header in headers_html:

AttributeError: 'NoneType' object has no attribute 'find'

所以是的，問題出在

table = soup.find("table", {"class": "tablesoreder, depth_chart tablesorter tablesorter-default"})

操作說明。

您可以修改它，以便按照其他用戶的建議將 class 屬性拆分為空格。 但是，您將再次失敗，因為已解析的表沒有 tbody。

固定的腳本看起來像

import requests
import csv
import datetime
from bs4 import BeautifulSoup

# static urls
season = datetime.datetime.now().year
URL = "https://www.fangraphs.com/depthcharts.aspx?position=Team".format(season=season)

# request the data
batting_html = requests.get(URL).text

def parse_array_from_fangraphs_html(input_html, out_file_name):
    """
    Take a HTML stats page from fangraphs and parse it out to a CSV file.
    """
    # parse input
    soup = BeautifulSoup(input_html, "lxml")
    table = soup.find("table", class_=["tablesoreder,", "depth_chart", "tablesorter", "tablesorter-default"])

    # get headers
    headers_html = table.find("thead").find_all("th")
    headers = []
    for header in headers_html:
        headers.append(header.text)
    print(headers)

    # get rows
    rows = []
    rows_html = table.find_all("tr")
    for row in rows_html:
        row_data = []
        for cell in row.find_all("td"):
            row_data.append(cell.text)
        rows.append(row_data)

    # write to CSV file
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file)
        writer.writerow(headers)
        writer.writerows(rows)

parse_array_from_fangraphs_html(batting_html, 'Team War Totals.csv')

Answer 2

將您的表語句替換為：

table = soup.find("table", attrs={"class": ["tablesoreder,", "depth_chart", "tablesorter", "tablesorter-default"]})

同樣，一旦您解決了這個問題，您的 header 將無法正常工作，因為該表有一個內部有一個 tr 然后最后是 td 的thead。 因此，您必須將該語句替換為：

headers_html = table.find("thead").find("tr").find_all("th")

使用 Python Beautifulsoup 從特定頁面提取數據

問題描述

2 個解決方案

解決方案1
0 已采納 2020-05-11 20:00:18

解決方案2
0 2020-05-11 20:07:33

使用 Python Beautifulsoup 從特定頁面提取數據

問題描述

2 個解決方案

解決方案1 0 已采納 2020-05-11 20:00:18

解決方案2 0 2020-05-11 20:07:33

解決方案1
0 已采納 2020-05-11 20:00:18

解決方案2
0 2020-05-11 20:07:33