[英]Python table how to scrape with bs4
我一直试图弄清楚如何使用 BS4 和 HTML 从网站上刮掉一张桌子,我在这个论坛上看到了相同类型的代码。
from bs4 import NavigableString
url="https://www.basketball-reference.com/leagues/NBA_2020.html"
res = requests.get(url)
id="all_misc_stats"
html = BeautifulSoup(res.content, 'html.parser')
data=pd.read_html(html.find_all(string=lambda x: isinstance(x, NavigableString) and id in x))
pace=pd.read_html(data)[0]
我正在尝试获取 Miscellaneous stats 表,但它一直告诉我它超出范围或无法解析。 我应该怎么办?
您要查找的表数据放置在HTML comment
中,因此可能的解决方案是解析这些元素,并在找到匹配的id
时返回。
from urllib.request import urlopen
from bs4 import BeautifulSoup, Comment #import the Comment object
import pandas as pd
url = "https://www.basketball-reference.com/leagues/NBA_2020.html"
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
ele = BeautifulSoup(c.strip(), 'html.parser')
if tbl := ele.find("table"):
if (tbl_id := tbl.get("id")) == "misc_stats":
pace = pd.read_html(str(tbl), header=1)[0]
print(pace.head())
Output:
Rk Team Age W L PW ... TOV%.1 DRB% FT/FGA.1 Arena Attend. Attend./G
0 1.0 Milwaukee Bucks* 29.2 56.0 17.0 57 ... 12.0 81.6 0.178 Fiserv Forum 549036 17711
1 2.0 Los Angeles Clippers* 27.4 49.0 23.0 50 ... 12.2 77.6 0.206 STAPLES Center 610176 19068
2 3.0 Los Angeles Lakers* 29.5 52.0 19.0 48 ... 14.1 78.8 0.205 STAPLES Center 588907 18997
3 4.0 Toronto Raptors* 26.6 53.0 19.0 50 ... 14.6 76.7 0.202 Scotiabank Arena 633456 19796
4 5.0 Boston Celtics* 25.3 48.0 24.0 50 ... 13.5 77.4 0.215 TD Garden 610864 19090
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.