I have been trying to figure out how to scrape a table off a website using BS4 and the HTML, and I've been seeing the same type of code around this forum.
from bs4 import NavigableString
url="https://www.basketball-reference.com/leagues/NBA_2020.html"
res = requests.get(url)
id="all_misc_stats"
html = BeautifulSoup(res.content, 'html.parser')
data=pd.read_html(html.find_all(string=lambda x: isinstance(x, NavigableString) and id in x))
pace=pd.read_html(data)[0]
I'm trying to get the Miscellaneous stats table, but it keeps telling me it is either out of range or cannot parse. What should I do?
The table data you are looking for is placed inside an HTML comment
, so a possible solution would be to parse these elements, and return when it finds the matching id
.
from urllib.request import urlopen
import pandas as pd
url = "https://www.basketball-reference.com/leagues/NBA_2020.html"
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
ele = BeautifulSoup(c.strip(), 'html.parser')
if tbl := ele.find("table"):
if (tbl_id := tbl.get("id")) == "misc_stats":
pace = pd.read_html(str(tbl), header=1)[0]
print(pace.head())
Output:
Rk Team Age W L PW ... TOV%.1 DRB% FT/FGA.1 Arena Attend. Attend./G
0 1.0 Milwaukee Bucks* 29.2 56.0 17.0 57 ... 12.0 81.6 0.178 Fiserv Forum 549036 17711
1 2.0 Los Angeles Clippers* 27.4 49.0 23.0 50 ... 12.2 77.6 0.206 STAPLES Center 610176 19068
2 3.0 Los Angeles Lakers* 29.5 52.0 19.0 48 ... 14.1 78.8 0.205 STAPLES Center 588907 18997
3 4.0 Toronto Raptors* 26.6 53.0 19.0 50 ... 14.6 76.7 0.202 Scotiabank Arena 633456 19796
4 5.0 Boston Celtics* 25.3 48.0 24.0 50 ... 13.5 77.4 0.215 TD Garden 610864 19090
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.