繁体   English   中英

Python表如何用bs4刮

[英]Python table how to scrape with bs4

我一直试图弄清楚如何使用 BS4 和 HTML 从网站上刮掉一张桌子,我在这个论坛上看到了相同类型的代码。

from bs4 import NavigableString

url="https://www.basketball-reference.com/leagues/NBA_2020.html"
res = requests.get(url)
id="all_misc_stats"
html = BeautifulSoup(res.content, 'html.parser')
data=pd.read_html(html.find_all(string=lambda x: isinstance(x, NavigableString) and id in x))

pace=pd.read_html(data)[0]

我正在尝试获取 Miscellaneous stats 表,但它一直告诉我它超出范围或无法解析。 我应该怎么办?

您要查找的表数据放置在HTML comment中,因此可能的解决方案是解析这些元素,并在找到匹配的id时返回。

from urllib.request import urlopen
from bs4 import BeautifulSoup, Comment #import the Comment object
import pandas as pd

url = "https://www.basketball-reference.com/leagues/NBA_2020.html"
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
    ele = BeautifulSoup(c.strip(), 'html.parser')
    if tbl := ele.find("table"):
        if (tbl_id := tbl.get("id")) == "misc_stats":
            pace = pd.read_html(str(tbl), header=1)[0]

print(pace.head())

Output:

    Rk                   Team   Age     W     L  PW  ...  TOV%.1  DRB%  FT/FGA.1             Arena  Attend.  Attend./G
0  1.0       Milwaukee Bucks*  29.2  56.0  17.0  57  ...    12.0  81.6     0.178      Fiserv Forum   549036      17711
1  2.0  Los Angeles Clippers*  27.4  49.0  23.0  50  ...    12.2  77.6     0.206    STAPLES Center   610176      19068
2  3.0    Los Angeles Lakers*  29.5  52.0  19.0  48  ...    14.1  78.8     0.205    STAPLES Center   588907      18997
3  4.0       Toronto Raptors*  26.6  53.0  19.0  50  ...    14.6  76.7     0.202  Scotiabank Arena   633456      19796
4  5.0        Boston Celtics*  25.3  48.0  24.0  50  ...    13.5  77.4     0.215         TD Garden   610864      19090

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM