[英]How do I parse out specific dataframes when using pandas to web scrape data from a page with multiple dataframes?
我正在開展一個項目,該項目將從https://www.pro-football-reference.com/years/2021/opp.htm中刮取數據。 當您訪問此網頁時,您會看到有多個表格。 當我運行以下代碼時,我可以毫無問題地獲得第一個表:
import pandas as pd
year=2021
defense_url = 'https://www.pro-football-reference.com/years/{}/opp.htm'.format(year)
df5 = pd.read_html(defense_url, header=1)[0]
df5.head()
但是,當我嘗試通過更改索引從其他表中獲取數據時,我得到一個沒有標題或錯誤的表。 例如, df5 = pd.read_html(defense_url, header=1)[1]
將創建一個沒有標頭 a 的數據框(如下圖所示):
此外, df5 = pd.read_html(defense_url, header=1)[2]
生成一個 IndexError(如下所示):
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 df5 = pd.read_html(defense_url, header=1)[2]
2 df5.head()
IndexError: list index out of range
有誰知道我在這里可能做錯了什么?
索引錯誤的原因是因為 html 響應只返回了 2 個<table>
標簽。 因此,當您嘗試在索引位置 [2](第三個表)處獲取數據幀時,它不在數據幀列表中。
那些其他表實際上存在於 html 響應中,但作為注釋。 所以有兩種方法可以得到:
我將在下面為您編寫代碼:
1. bs4 評論:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
year=2021
defense_url = 'https://www.pro-football-reference.com/years/{}/opp.htm'.format(year)
df5 = pd.read_html(defense_url, header=1)
result = requests.get(defense_url).text
data = BeautifulSoup(result, 'html.parser')
comments = data.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'table' in str(each):
try:
level = 0
if isinstance(pd.read_html(str(each))[0].columns, pd.MultiIndex):
level = 1
df5.append(pd.read_html(str(each), header=level)[0])
except:
continue
2:刪除/替換表示注釋 html 的 html 字符串:
import requests
import pandas as pd
year=2021
defense_url = 'https://www.pro-football-reference.com/years/{}/opp.htm'.format(year)
# Get the data
response = requests.get(defense_url)
html = response.text.replace('<!--', '').replace('-->', '')
df5 = pd.read_html(html)
2 之間的唯一區別是第二個選項,您需要再執行一步來查找哪些數據幀是多索引的,並在需要時處理這些數據幀。
選項 1 的輸出:
print(df5[2])
Rk Tm G Cmp ... Sk% NY/A ANY/A EXP
0 1.0 Buffalo Bills 17.0 297.0 ... 7.3 4.80 3.8 20.04
1 2.0 New England Patriots 17.0 319.0 ... 6.3 5.50 4.5 -0.86
2 3.0 Chicago Bears 17.0 314.0 ... 9.3 6.20 6.7 -78.80
3 4.0 Carolina Panthers 17.0 337.0 ... 7.0 5.90 6.1 -47.59
4 5.0 Cleveland Browns 17.0 367.0 ... 6.9 5.60 5.5 -51.01
5 6.0 San Francisco 49ers 17.0 372.0 ... 8.1 5.90 6.1 -91.78
6 7.0 Arizona Cardinals 17.0 367.0 ... 6.8 6.10 6.1 -49.69
7 8.0 Denver Broncos 17.0 341.0 ... 6.0 6.10 5.9 -43.39
8 9.0 Pittsburgh Steelers 17.0 355.0 ... 8.9 5.90 5.7 -44.60
9 10.0 Green Bay Packers 17.0 379.0 ... 6.1 5.80 5.5 -33.40
10 11.0 Philadelphia Eagles 17.0 409.0 ... 4.7 6.10 6.1 -81.44
11 12.0 Los Angeles Chargers 17.0 357.0 ... 5.9 6.30 6.4 -89.74
12 13.0 Las Vegas Raiders 17.0 400.0 ... 5.5 5.90 6.4 -159.12
13 14.0 New Orleans Saints 17.0 369.0 ... 7.2 6.00 5.3 -1.74
14 15.0 New York Giants 17.0 402.0 ... 5.3 6.00 5.7 -56.19
15 16.0 Miami Dolphins 17.0 373.0 ... 7.3 5.90 5.6 -26.14
16 17.0 Jacksonville Jaguars 17.0 377.0 ... 5.6 6.70 7.0 -137.16
17 18.0 Atlanta Falcons 17.0 391.0 ... 3.0 6.60 6.8 -119.11
18 19.0 Indianapolis Colts 17.0 390.0 ... 5.2 6.30 6.0 -73.90
19 20.0 Dallas Cowboys 17.0 364.0 ... 6.3 6.20 5.1 13.82
20 21.0 Tampa Bay Buccaneers 17.0 445.0 ... 6.5 5.60 5.3 -30.09
21 22.0 Los Angeles Rams 17.0 416.0 ... 7.4 6.10 5.3 -41.65
22 23.0 Houston Texans 17.0 363.0 ... 5.5 7.10 6.7 -131.46
23 24.0 Detroit Lions 17.0 359.0 ... 5.2 7.20 7.5 -161.97
24 25.0 Tennessee Titans 17.0 395.0 ... 6.4 6.20 5.9 -59.34
25 26.0 Cincinnati Bengals 17.0 420.0 ... 6.3 6.30 6.2 -67.23
26 27.0 Kansas City Chiefs 17.0 401.0 ... 4.8 6.70 6.5 -110.25
27 28.0 Minnesota Vikings 17.0 401.0 ... 7.5 6.40 6.1 -57.45
28 29.0 Washington Football Team 17.0 400.0 ... 6.0 6.80 7.1 -142.08
29 30.0 New York Jets 17.0 401.0 ... 5.3 7.10 7.5 -198.18
30 31.0 Seattle Seahawks 17.0 443.0 ... 4.9 6.50 6.5 -126.34
31 32.0 Baltimore Ravens 17.0 397.0 ... 5.2 7.20 7.6 -166.26
32 NaN Avg Team NaN 378.8 ... 6.2 6.22 6.1 -76.40
33 NaN League Total NaN 12121.0 ... 6.2 6.22 6.1 NaN
34 NaN Avg Tm/G NaN 22.3 ... 6.2 6.22 6.1 NaN
[35 rows x 25 columns]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.