繁体   English   中英

在href(不是url)之后提取数据的漂亮汤

[英]Beautiful Soup Extracting Data After href (not url)

我是使用BeautifulSoup的新手,并且尝试使用它从NHL.com获取一些测试数据。 到目前为止,这是我的代码,但是我很迷路...

这是我要从中提取数据的HTML代码的片段:

<tr>
    <td rowspan="1" colspan="1"> … </td>
    <td style="text-align: left;" rowspan="1" colspan="1">
        <a href="/ice/player.htm?id=8474564">

            Steven Stamkos

        </a>
    </td>
    <td style="text-align: center;" rowspan="1" colspan="1">
        <a href="javascript:void(0);" rel="TBL" onclick="loadTeamSpotlight(jQuery(this));" style="border-bottom:1px dotted;">

            TBL

        </a>
    </td>
    <td style="text-align: center;" rowspan="1" colspan="1">

        C

    </td>
    <td style="center" rowspan="1" colspan="1">

        16

    </td>
    <td style="center" rowspan="1" colspan="1">

        14

    </td>
    <td style="center" rowspan="1" colspan="1">

        9

    </td>

我想从整个页面的这些字段中提取数据,因此大约有30个不同的表行。 到目前为止,这是我的Python代码,我不确定该去哪里。

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.nhl.com/ice/playerstats.htm?fetchKey=20142ALLSASAll&viewName=summary&sort=points&pg=1")

data = r.text
t_data=[]
soup = BeautifulSoup(data)
table = soup.find('table', {'class': 'data stats'})

我知道不多,但我不知道该如何处理。 谢谢大家的帮助

编辑:我解决了问题,并希望这会在将来对任何人有帮助。 这是我的代码:

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.nhl.com/ice/playerstats.htm?fetchKey=20142ALLSASAll&viewName=summary&sort=points&pg=1")

player=[]
team=[]
goals=[]
assists=[]
cells=[]
points=[]
i=0
data = r.text
soup = BeautifulSoup(data)
table = soup.find('table', {'class': 'data stats'})
row=[]
for rows in table.find_all('tr'):
    cells=rows.find_all('td')
    if(len(cells)==19):
        player.append(cells[1].find(text=True))
        team.append(cells[2].find(text=True))
        goals.append(cells[5].find(text=True))
        assists.append(cells[6].find(text=True))
        points.append(cells[7].find(text=True))
        print(player[i],team[i],goals[i],assists[i],points[i])
        i=i+1

我只是想发布另一种方法,所以您不必使用6个不同的列表来存储连接的数据。 此外,还有一种更短,更优雅的方式来获取所有预期的行。

# getting data
#...
from bs4 import BeautifulSoup
from collections import namedtuple
soup = BeautifulSoup(data)
# thats where the data are collected
rows = list()
# named tuple to store the relevant data of one player
Player = namedtuple('Player', ['name', 'team', 'goals', 'assists', 'points'])
# getting every row of the tbody in the specified table
for tr in soup.select('table.data.stats tbody tr'):
    # put text-contents of the row in a list
    cellStrings = [cell.find(text = True) for cell in tr.findAll('td')]
    # add it to the
    rows.append(
        Player(
            name=cellStrings[1],
            team=cellStrings[2],
            goals=cellStrings[5],
            assists=cellStrings[6],
            points=cellStrings[7]
        )
    )

rows看起来像那样

[Player(name=u'Steven Stamkos', team=u'TBL', goals=u'14', assists=u'9', points=u'23'),
 Player(name=u'Sidney Crosby', team=u'PIT', goals=u'8', assists=u'15', points=u'23'),
 Player(name=u'Ryan Getzlaf', team=u'ANA', goals=u'10', assists=u'12', points=u'22'),
 Player(name=u'Alexander Steen', team=u'STL', goals=u'14', assists=u'7', points=u'21'),
 Player(name=u'Corey Perry', team=u'ANA', goals=u'11', assists=u'10', points=u'21'),
 Player(name=u'Alex Ovechkin', team=u'WSH', goals=u'13', assists=u'7', points=u'20'),
 ....

这样访问

>>> rows[20].name
u'Bryan Little'

您没有确切提到所需的数据,但是可以按照以下步骤进行操作:

from BeautifulSoup import BeautifulSoup
...
table = soup.find('table', {'class': 'data stats'})
rows = table.find('tr')
for row in rows:
    cols = row.findAll('td')
    for col in cols:
        print col.text
        link = col.find("a")
        if link:
            print link.get("href"), link.get("rel"), link.get("onclick"), link.text

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM