Python表如何用bs4刮

Question

我一直试图弄清楚如何使用 BS4 和 HTML 从网站上刮掉一张桌子，我在这个论坛上看到了相同类型的代码。

from bs4 import NavigableString

url="https://www.basketball-reference.com/leagues/NBA_2020.html"
res = requests.get(url)
id="all_misc_stats"
html = BeautifulSoup(res.content, 'html.parser')
data=pd.read_html(html.find_all(string=lambda x: isinstance(x, NavigableString) and id in x))

pace=pd.read_html(data)[0]

我正在尝试获取 Miscellaneous stats 表，但它一直告诉我它超出范围或无法解析。 我应该怎么办？

Answer 1

您要查找的表数据放置在HTML comment中，因此可能的解决方案是解析这些元素，并在找到匹配的id时返回。

from urllib.request import urlopen
from bs4 import BeautifulSoup, Comment #import the Comment object
import pandas as pd

url = "https://www.basketball-reference.com/leagues/NBA_2020.html"
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
    ele = BeautifulSoup(c.strip(), 'html.parser')
    if tbl := ele.find("table"):
        if (tbl_id := tbl.get("id")) == "misc_stats":
            pace = pd.read_html(str(tbl), header=1)[0]

print(pace.head())

Output：

    Rk                   Team   Age     W     L  PW  ...  TOV%.1  DRB%  FT/FGA.1             Arena  Attend.  Attend./G
0  1.0       Milwaukee Bucks*  29.2  56.0  17.0  57  ...    12.0  81.6     0.178      Fiserv Forum   549036      17711
1  2.0  Los Angeles Clippers*  27.4  49.0  23.0  50  ...    12.2  77.6     0.206    STAPLES Center   610176      19068
2  3.0    Los Angeles Lakers*  29.5  52.0  19.0  48  ...    14.1  78.8     0.205    STAPLES Center   588907      18997
3  4.0       Toronto Raptors*  26.6  53.0  19.0  50  ...    14.6  76.7     0.202  Scotiabank Arena   633456      19796
4  5.0        Boston Celtics*  25.3  48.0  24.0  50  ...    13.5  77.4     0.215         TD Garden   610864      19090

Python表如何用bs4刮

问题描述

1 个解决方案

解决方案1
0 2021-03-16 00:34:54

Python表如何用bs4刮

问题描述

1 个解决方案

解决方案1 0 2021-03-16 00:34:54

解决方案1
0 2021-03-16 00:34:54