使用Python / BS4刮取表格

Question

Im trying to scrape the "Team Stats" table from http://www.pro-football-reference.com/boxscores/201602070den.htm with BS4 and Python 2.7. 我正在尝试使用BS4和Python 2.7从http://www.pro-football-reference.com/boxscores/201602070den.htm抓取“ Team Stats”表。 However Im unable to get anywhere close to it, 但是我无法接近任何地方，

url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html5lib")
table=soup.findAll('table', {'id':"team_stats", "class":"stats_table"})  
print table

I thought something like the above code would work but no luck. 我以为上面的代码可以工作，但是没有运气。

Answer 1

The problem in this case is that the "Team Stats" table is located inside a comment in the HTML source which you download with requests . 在这种情况下，问题是“ Team Stats”表位于您通过requests下载的HTML源代码中的注释内 。 Locate the comment and reparse it with BeautifulSoup into a "soup" object: 找到注释，并使用BeautifulSoup将其重新解析为“汤”对象：

import requests
from bs4 import BeautifulSoup, NavigableString

url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})

soup = BeautifulSoup(page.content, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "team_stats" in x)

soup = BeautifulSoup(comment, "html5lib")
table = soup.find("table", id="team_stats")
print(table)

And/or, you can load the table into, for example, a pandas dataframe which is very convenient to work with: 和/或，您可以将表加载到例如pandas数据帧中，使用起来非常方便：

import pandas as pd
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString

url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})

soup = BeautifulSoup(page.content, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "team_stats" in x)

df = pd.read_html(comment)[0]
print(df)

Prints: 印刷品：

            Unnamed: 0            DEN            CAR
0          First Downs             11             21
1         Rush-Yds-TDs        28-90-1       27-118-1
2    Cmp-Att-Yd-TD-INT  13-23-141-0-1  18-41-265-0-1
3         Sacked-Yards           5-37           7-68
4       Net Pass Yards            104            197
5          Total Yards            194            315
6         Fumbles-Lost            3-1            4-3
7            Turnovers              2              4
8      Penalties-Yards           6-51         12-102
9     Third Down Conv.           1-14           3-15
10   Fourth Down Conv.            0-0            0-0
11  Time of Possession          27:13          32:47

使用Python / BS4刮取表格

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-07-25 18:47:51

使用Python / BS4刮取表格

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-07-25 18:47:51

解决方案1
1 已采纳 2016-07-25 18:47:51