[英]Python Pandas read_html function missing some tables from Pro-Football-Reference
I am trying to read a lot of tables from a specific webpage in Python, and am struggling a bit.我正在尝试从 Python 中的特定网页读取很多表格,并且有点挣扎。 My first go at this was using Pandas read_html due to it's simplicity;
我的第一个 go 是使用 Pandas read_html,因为它很简单; so for example, I will be using this website:
例如,我将使用这个网站:
https://www.pro-football-reference.com/years/2019/ https://www.pro-football-reference.com/years/2019/
For read_html, I tried the following:对于 read_html,我尝试了以下方法:
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2019'
allDfs = pd.read_html(url, header=0)
print(len(allDfs))
Which yields a count of tables of 2. However, if you follow that URL, you will see that there are many more than 2 tables, and they aren't being caught by the read_html function.这会产生 2 个表的计数。但是,如果您遵循 URL,您会看到有超过 2 个表,并且它们没有被 read_html function 捕获。
Next, I tried using requests and BeautifulSoup, with the following code:接下来,我尝试使用 requests 和 BeautifulSoup,代码如下:
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/years/2019'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
allTables = soup.find_all('table')
print(len(allTables))
This also outputs only 2 tables.这也只输出 2 个表。 I take this step a little bit further, and try to inspect one of the tables further down in the raw HTML that exists but is not being found;
我将这一步更进一步,并尝试检查存在但未找到的原始 HTML 中的一张表; in this example, I will use the "Team Offense" table, which has a table tag and an id "team_stats".
在本例中,我将使用“Team Offense”表,该表有一个表标签和一个 ID“team_stats”。 However, this code returns 0 tables found:
但是,此代码返回 0 个找到的表:
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/years/2019'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
allTables = soup.find_all('table', attrs={"id":"team_stats"})
print(len(allTables))
Finally, I came to stackoverflow, and found the following question/response:最后,我来到stackoverflow,发现以下问题/回复:
Pandas read_html missing some tables Pandas read_html 缺少一些表格
So following these directions, I use urllib.request in conjuction with BeautifulSoup and Pandas and should be able to get the result...except I still only get 2 tables back:因此,按照这些指示,我将 urllib.request 与 BeautifulSoup 和 Pandas 结合使用,并且应该能够得到结果……除了我仍然只能得到 2 个表:
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
html_text = urllib.request.urlopen("https://www.pro-football-reference.com/years/2019/#all_team_stats")
bs_obj = BeautifulSoup(html_text,features='lxml')
tables = bs_obj.findAll('table')
dfs = list()
for table in tables:
df = pd.read_html(str(table))[0]
dfs.append(df)
print(len(dfs))
Can anyone here help me figure out why any of these methods are not working?这里的任何人都可以帮我弄清楚为什么这些方法中的任何一个都不起作用吗? You can clearly see that there are many more than 2 tables on this page, but none of these methods are able to find them.
您可以清楚地看到此页面上有超过 2 个表,但是这些方法都无法找到它们。
These tables are "hidden" within the comments in HTML
, but, as stated by other users they appear to be loaded dynamically.这些表“隐藏”在
HTML
的注释中,但是,正如其他用户所述,它们似乎是动态加载的。
Anyhow, here's how to get them all:无论如何,这里是如何获得它们:
import pandas as pd
import requests
from bs4 import BeautifulSoup, Comment
response = requests.get('https://www.pro-football-reference.com/years/2019/')
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except:
continue
print(tables[-1].loc[1:])
This, for examples, prints the last table Drive Averages
:例如,这会打印最后一个表
Drive Averages
:
Unnamed: 0_level_0 Unnamed: 1_level_0 ... Average Drive
Rk Tm ... Time Pts
1 2.0 Carolina Panthers ... 2:25 1.74
2 3.0 New England Patriots ... 2:40 1.97
3 4.0 New York Jets ... 2:27 1.23
4 5.0 Seattle Seahawks ... 2:40 2.07
5 6.0 Philadelphia Eagles ... 2:50 1.96
6 7.0 Pittsburgh Steelers ... 2:28 1.40
7 8.0 Buffalo Bills ... 2:35 1.63
8 9.0 New York Giants ... 2:29 1.75
9 10.0 Tennessee Titans ... 2:28 2.02
10 11.0 Los Angeles Rams ... 2:31 1.98
11 12.0 Detroit Lions ... 2:33 1.77
12 13.0 San Francisco 49ers ... 2:46 2.44
13 14.0 Cleveland Browns ... 2:36 1.83
14 15.0 Miami Dolphins ... 2:32 1.61
15 16.0 Cincinnati Bengals ... 2:37 1.49
16 17.0 Arizona Cardinals ... 2:29 2.02
17 18.0 Green Bay Packers ... 2:51 2.11
18 19.0 Jacksonville Jaguars ... 2:45 1.63
19 20.0 Washington Redskins ... 2:28 1.50
20 21.0 Dallas Cowboys ... 2:42 2.43
21 22.0 Chicago Bears ... 2:46 1.51
22 23.0 Atlanta Falcons ... 2:51 2.12
23 24.0 New Orleans Saints ... 2:59 2.50
24 25.0 Minnesota Vikings ... 2:40 2.29
25 26.0 Denver Broncos ... 2:43 1.60
26 27.0 Houston Texans ... 2:51 2.18
27 28.0 Indianapolis Colts ... 2:52 2.02
28 29.0 Baltimore Ravens ... 3:21 2.95
29 30.0 Oakland Raiders ... 3:00 1.83
30 31.0 Kansas City Chiefs ... 2:52 2.59
31 32.0 Los Angeles Chargers ... 3:05 2.06
32 NaN League Total ... 2:41 1.94
[32 rows x 12 columns]
Process finished with exit code 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.