[英]Scraping Table With Python/BS4
Im trying to scrape the "Team Stats" table from http://www.pro-football-reference.com/boxscores/201602070den.htm with BS4 and Python 2.7. 我正在尝试使用BS4和Python 2.7从http://www.pro-football-reference.com/boxscores/201602070den.htm抓取“ Team Stats”表。 However Im unable to get anywhere close to it, 但是我无法接近任何地方,
url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html5lib")
table=soup.findAll('table', {'id':"team_stats", "class":"stats_table"})
print table
I thought something like the above code would work but no luck. 我以为上面的代码可以工作,但是没有运气。
The problem in this case is that the "Team Stats" table is located inside a comment in the HTML source which you download with requests
. 在这种情况下,问题是“ Team Stats”表位于您通过requests
下载的HTML源代码中的注释内 。 Locate the comment and reparse it with BeautifulSoup
into a "soup" object: 找到注释,并使用BeautifulSoup
将其重新解析为“汤”对象:
import requests
from bs4 import BeautifulSoup, NavigableString
url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})
soup = BeautifulSoup(page.content, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "team_stats" in x)
soup = BeautifulSoup(comment, "html5lib")
table = soup.find("table", id="team_stats")
print(table)
And/or, you can load the table into, for example, a pandas
dataframe which is very convenient to work with: 和/或,您可以将表加载到例如pandas
数据帧中 ,使用起来非常方便:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString
url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})
soup = BeautifulSoup(page.content, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "team_stats" in x)
df = pd.read_html(comment)[0]
print(df)
Prints: 印刷品:
Unnamed: 0 DEN CAR
0 First Downs 11 21
1 Rush-Yds-TDs 28-90-1 27-118-1
2 Cmp-Att-Yd-TD-INT 13-23-141-0-1 18-41-265-0-1
3 Sacked-Yards 5-37 7-68
4 Net Pass Yards 104 197
5 Total Yards 194 315
6 Fumbles-Lost 3-1 4-3
7 Turnovers 2 4
8 Penalties-Yards 6-51 12-102
9 Third Down Conv. 1-14 3-15
10 Fourth Down Conv. 0-0 0-0
11 Time of Possession 27:13 32:47
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.