简体   繁体   中英

Python table how to scrape with bs4

I have been trying to figure out how to scrape a table off a website using BS4 and the HTML, and I've been seeing the same type of code around this forum.

from bs4 import NavigableString

url="https://www.basketball-reference.com/leagues/NBA_2020.html"
res = requests.get(url)
id="all_misc_stats"
html = BeautifulSoup(res.content, 'html.parser')
data=pd.read_html(html.find_all(string=lambda x: isinstance(x, NavigableString) and id in x))

pace=pd.read_html(data)[0]

I'm trying to get the Miscellaneous stats table, but it keeps telling me it is either out of range or cannot parse. What should I do?

The table data you are looking for is placed inside an HTML comment , so a possible solution would be to parse these elements, and return when it finds the matching id .

from urllib.request import urlopen

import pandas as pd

url = "https://www.basketball-reference.com/leagues/NBA_2020.html"
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
    ele = BeautifulSoup(c.strip(), 'html.parser')
    if tbl := ele.find("table"):
        if (tbl_id := tbl.get("id")) == "misc_stats":
            pace = pd.read_html(str(tbl), header=1)[0]

print(pace.head())

Output:

    Rk                   Team   Age     W     L  PW  ...  TOV%.1  DRB%  FT/FGA.1             Arena  Attend.  Attend./G
0  1.0       Milwaukee Bucks*  29.2  56.0  17.0  57  ...    12.0  81.6     0.178      Fiserv Forum   549036      17711
1  2.0  Los Angeles Clippers*  27.4  49.0  23.0  50  ...    12.2  77.6     0.206    STAPLES Center   610176      19068
2  3.0    Los Angeles Lakers*  29.5  52.0  19.0  48  ...    14.1  78.8     0.205    STAPLES Center   588907      18997
3  4.0       Toronto Raptors*  26.6  53.0  19.0  50  ...    14.6  76.7     0.202  Scotiabank Arena   633456      19796
4  5.0        Boston Celtics*  25.3  48.0  24.0  50  ...    13.5  77.4     0.215         TD Garden   610864      19090

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM