簡體   English   中英

使用 pandas 從具有多個數據框的頁面中抓取數據時,如何解析特定的數據框?

[英]How do I parse out specific dataframes when using pandas to web scrape data from a page with multiple dataframes?

我正在開展一個項目,該項目將從https://www.pro-football-reference.com/years/2021/opp.htm中刮取數據。 當您訪問此網頁時,您會看到有多個表格。 當我運行以下代碼時,我可以毫無問題地獲得第一個表:

import pandas as pd
year=2021
defense_url = 'https://www.pro-football-reference.com/years/{}/opp.htm'.format(year)
df5 = pd.read_html(defense_url, header=1)[0]
df5.head()

但是,當我嘗試通過更改索引從其他表中獲取數據時,我得到一個沒有標題或錯誤的表。 例如, df5 = pd.read_html(defense_url, header=1)[1]將創建一個沒有標頭 a 的數據框(如下圖所示): 在此處輸入圖像描述

此外, df5 = pd.read_html(defense_url, header=1)[2]生成一個 IndexError(如下所示):

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 df5 = pd.read_html(defense_url, header=1)[2]
      2 df5.head()

IndexError: list index out of range

有誰知道我在這里可能做錯了什么?

索引錯誤的原因是因為 html 響應只返回了 2 個<table>標簽。 因此,當您嘗試在索引位置 [2](第三個表)處獲取數據幀時,它不在數據幀列表中。

那些其他表實際上存在於 html 響應中,但作為注釋。 所以有兩種方法可以得到:

  1. 使用 BeautifulSoups 功能提取評論並解析它們。
  2. 只需刪除/替換表示注釋 html 的 html 字符串。

我將在下面為您編寫代碼:

1. bs4 評論:

import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd

year=2021
defense_url = 'https://www.pro-football-reference.com/years/{}/opp.htm'.format(year)

df5 = pd.read_html(defense_url, header=1)

result = requests.get(defense_url).text
data = BeautifulSoup(result, 'html.parser')

comments = data.find_all(string=lambda text: isinstance(text, Comment))

for each in comments:
    if 'table' in str(each):
        try:
            level = 0
            if isinstance(pd.read_html(str(each))[0].columns, pd.MultiIndex):
                level = 1
            df5.append(pd.read_html(str(each), header=level)[0])
        except:
            continue

2:刪除/替換表示注釋 html 的 html 字符串:

import requests
import pandas as pd

year=2021
defense_url = 'https://www.pro-football-reference.com/years/{}/opp.htm'.format(year)

# Get the data
response = requests.get(defense_url)
html = response.text.replace('<!--', '').replace('-->', '')

df5 = pd.read_html(html)

2 之間的唯一區別是第二個選項,您需要再執行一步來查找哪些數據幀是多索引的,並在需要時處理這些數據幀。

選項 1 的輸出:

print(df5[2])
      Rk                        Tm     G      Cmp  ...  Sk%  NY/A  ANY/A     EXP
0    1.0             Buffalo Bills  17.0    297.0  ...  7.3  4.80    3.8   20.04
1    2.0      New England Patriots  17.0    319.0  ...  6.3  5.50    4.5   -0.86
2    3.0             Chicago Bears  17.0    314.0  ...  9.3  6.20    6.7  -78.80
3    4.0         Carolina Panthers  17.0    337.0  ...  7.0  5.90    6.1  -47.59
4    5.0          Cleveland Browns  17.0    367.0  ...  6.9  5.60    5.5  -51.01
5    6.0       San Francisco 49ers  17.0    372.0  ...  8.1  5.90    6.1  -91.78
6    7.0         Arizona Cardinals  17.0    367.0  ...  6.8  6.10    6.1  -49.69
7    8.0            Denver Broncos  17.0    341.0  ...  6.0  6.10    5.9  -43.39
8    9.0       Pittsburgh Steelers  17.0    355.0  ...  8.9  5.90    5.7  -44.60
9   10.0         Green Bay Packers  17.0    379.0  ...  6.1  5.80    5.5  -33.40
10  11.0       Philadelphia Eagles  17.0    409.0  ...  4.7  6.10    6.1  -81.44
11  12.0      Los Angeles Chargers  17.0    357.0  ...  5.9  6.30    6.4  -89.74
12  13.0         Las Vegas Raiders  17.0    400.0  ...  5.5  5.90    6.4 -159.12
13  14.0        New Orleans Saints  17.0    369.0  ...  7.2  6.00    5.3   -1.74
14  15.0           New York Giants  17.0    402.0  ...  5.3  6.00    5.7  -56.19
15  16.0            Miami Dolphins  17.0    373.0  ...  7.3  5.90    5.6  -26.14
16  17.0      Jacksonville Jaguars  17.0    377.0  ...  5.6  6.70    7.0 -137.16
17  18.0           Atlanta Falcons  17.0    391.0  ...  3.0  6.60    6.8 -119.11
18  19.0        Indianapolis Colts  17.0    390.0  ...  5.2  6.30    6.0  -73.90
19  20.0            Dallas Cowboys  17.0    364.0  ...  6.3  6.20    5.1   13.82
20  21.0      Tampa Bay Buccaneers  17.0    445.0  ...  6.5  5.60    5.3  -30.09
21  22.0          Los Angeles Rams  17.0    416.0  ...  7.4  6.10    5.3  -41.65
22  23.0            Houston Texans  17.0    363.0  ...  5.5  7.10    6.7 -131.46
23  24.0             Detroit Lions  17.0    359.0  ...  5.2  7.20    7.5 -161.97
24  25.0          Tennessee Titans  17.0    395.0  ...  6.4  6.20    5.9  -59.34
25  26.0        Cincinnati Bengals  17.0    420.0  ...  6.3  6.30    6.2  -67.23
26  27.0        Kansas City Chiefs  17.0    401.0  ...  4.8  6.70    6.5 -110.25
27  28.0         Minnesota Vikings  17.0    401.0  ...  7.5  6.40    6.1  -57.45
28  29.0  Washington Football Team  17.0    400.0  ...  6.0  6.80    7.1 -142.08
29  30.0             New York Jets  17.0    401.0  ...  5.3  7.10    7.5 -198.18
30  31.0          Seattle Seahawks  17.0    443.0  ...  4.9  6.50    6.5 -126.34
31  32.0          Baltimore Ravens  17.0    397.0  ...  5.2  7.20    7.6 -166.26
32   NaN                  Avg Team   NaN    378.8  ...  6.2  6.22    6.1  -76.40
33   NaN              League Total   NaN  12121.0  ...  6.2  6.22    6.1     NaN
34   NaN                  Avg Tm/G   NaN     22.3  ...  6.2  6.22    6.1     NaN

[35 rows x 25 columns]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM