[英]Beautiful Soup Extracting Data After href (not url)
我是使用BeautifulSoup的新手,并且尝试使用它从NHL.com获取一些测试数据。 到目前为止,这是我的代码,但是我很迷路...
这是我要从中提取数据的HTML代码的片段:
<tr>
<td rowspan="1" colspan="1"> … </td>
<td style="text-align: left;" rowspan="1" colspan="1">
<a href="/ice/player.htm?id=8474564">
Steven Stamkos
</a>
</td>
<td style="text-align: center;" rowspan="1" colspan="1">
<a href="javascript:void(0);" rel="TBL" onclick="loadTeamSpotlight(jQuery(this));" style="border-bottom:1px dotted;">
TBL
</a>
</td>
<td style="text-align: center;" rowspan="1" colspan="1">
C
</td>
<td style="center" rowspan="1" colspan="1">
16
</td>
<td style="center" rowspan="1" colspan="1">
14
</td>
<td style="center" rowspan="1" colspan="1">
9
</td>
我想从整个页面的这些字段中提取数据,因此大约有30个不同的表行。 到目前为止,这是我的Python代码,我不确定该去哪里。
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.nhl.com/ice/playerstats.htm?fetchKey=20142ALLSASAll&viewName=summary&sort=points&pg=1")
data = r.text
t_data=[]
soup = BeautifulSoup(data)
table = soup.find('table', {'class': 'data stats'})
我知道不多,但我不知道该如何处理。 谢谢大家的帮助
编辑:我解决了问题,并希望这会在将来对任何人有帮助。 这是我的代码:
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.nhl.com/ice/playerstats.htm?fetchKey=20142ALLSASAll&viewName=summary&sort=points&pg=1")
player=[]
team=[]
goals=[]
assists=[]
cells=[]
points=[]
i=0
data = r.text
soup = BeautifulSoup(data)
table = soup.find('table', {'class': 'data stats'})
row=[]
for rows in table.find_all('tr'):
cells=rows.find_all('td')
if(len(cells)==19):
player.append(cells[1].find(text=True))
team.append(cells[2].find(text=True))
goals.append(cells[5].find(text=True))
assists.append(cells[6].find(text=True))
points.append(cells[7].find(text=True))
print(player[i],team[i],goals[i],assists[i],points[i])
i=i+1
我只是想发布另一种方法,所以您不必使用6个不同的列表来存储连接的数据。 此外,还有一种更短,更优雅的方式来获取所有预期的行。
# getting data
#...
from bs4 import BeautifulSoup
from collections import namedtuple
soup = BeautifulSoup(data)
# thats where the data are collected
rows = list()
# named tuple to store the relevant data of one player
Player = namedtuple('Player', ['name', 'team', 'goals', 'assists', 'points'])
# getting every row of the tbody in the specified table
for tr in soup.select('table.data.stats tbody tr'):
# put text-contents of the row in a list
cellStrings = [cell.find(text = True) for cell in tr.findAll('td')]
# add it to the
rows.append(
Player(
name=cellStrings[1],
team=cellStrings[2],
goals=cellStrings[5],
assists=cellStrings[6],
points=cellStrings[7]
)
)
rows
看起来像那样
[Player(name=u'Steven Stamkos', team=u'TBL', goals=u'14', assists=u'9', points=u'23'),
Player(name=u'Sidney Crosby', team=u'PIT', goals=u'8', assists=u'15', points=u'23'),
Player(name=u'Ryan Getzlaf', team=u'ANA', goals=u'10', assists=u'12', points=u'22'),
Player(name=u'Alexander Steen', team=u'STL', goals=u'14', assists=u'7', points=u'21'),
Player(name=u'Corey Perry', team=u'ANA', goals=u'11', assists=u'10', points=u'21'),
Player(name=u'Alex Ovechkin', team=u'WSH', goals=u'13', assists=u'7', points=u'20'),
....
这样访问
>>> rows[20].name
u'Bryan Little'
您没有确切提到所需的数据,但是可以按照以下步骤进行操作:
from BeautifulSoup import BeautifulSoup
...
table = soup.find('table', {'class': 'data stats'})
rows = table.find('tr')
for row in rows:
cols = row.findAll('td')
for col in cols:
print col.text
link = col.find("a")
if link:
print link.get("href"), link.get("rel"), link.get("onclick"), link.text
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.