简体   繁体   English

在href(不是url)之后提取数据的漂亮汤

[英]Beautiful Soup Extracting Data After href (not url)

I am new to using BeautifulSoup and am try to use it to grab some test data from NHL.com. 我是使用BeautifulSoup的新手,并且尝试使用它从NHL.com获取一些测试数据。 Here is my code so far but I am pretty lost... 到目前为止,这是我的代码,但是我很迷路...

Here is a snippet of the HTML code I want to extract data from: 这是我要从中提取数据的HTML代码的片段:

<tr>
    <td rowspan="1" colspan="1"> … </td>
    <td style="text-align: left;" rowspan="1" colspan="1">
        <a href="/ice/player.htm?id=8474564">

            Steven Stamkos

        </a>
    </td>
    <td style="text-align: center;" rowspan="1" colspan="1">
        <a href="javascript:void(0);" rel="TBL" onclick="loadTeamSpotlight(jQuery(this));" style="border-bottom:1px dotted;">

            TBL

        </a>
    </td>
    <td style="text-align: center;" rowspan="1" colspan="1">

        C

    </td>
    <td style="center" rowspan="1" colspan="1">

        16

    </td>
    <td style="center" rowspan="1" colspan="1">

        14

    </td>
    <td style="center" rowspan="1" colspan="1">

        9

    </td>

I would like to extract data from these fields for the entire page, so there are about 30 different table rows. 我想从整个页面的这些字段中提取数据,因此大约有30个不同的表行。 Here is my Python code so far, I'm not really sure where to go. 到目前为止,这是我的Python代码,我不确定该去哪里。

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.nhl.com/ice/playerstats.htm?fetchKey=20142ALLSASAll&viewName=summary&sort=points&pg=1")

data = r.text
t_data=[]
soup = BeautifulSoup(data)
table = soup.find('table', {'class': 'data stats'})

I know it isn't much but I have no idea how to go about this. 我知道不多,但我不知道该如何处理。 Thanks for the help everyone 谢谢大家的帮助

EDIT: I solved the problem, and hopefully this will help anyone in the future. 编辑:我解决了问题,并希望这会在将来对任何人有帮助。 Here is my code: 这是我的代码:

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.nhl.com/ice/playerstats.htm?fetchKey=20142ALLSASAll&viewName=summary&sort=points&pg=1")

player=[]
team=[]
goals=[]
assists=[]
cells=[]
points=[]
i=0
data = r.text
soup = BeautifulSoup(data)
table = soup.find('table', {'class': 'data stats'})
row=[]
for rows in table.find_all('tr'):
    cells=rows.find_all('td')
    if(len(cells)==19):
        player.append(cells[1].find(text=True))
        team.append(cells[2].find(text=True))
        goals.append(cells[5].find(text=True))
        assists.append(cells[6].find(text=True))
        points.append(cells[7].find(text=True))
        print(player[i],team[i],goals[i],assists[i],points[i])
        i=i+1

I just wanted to post an other approach, so you don't have to use 6 different lists to store connected data. 我只是想发布另一种方法,所以您不必使用6个不同的列表来存储连接的数据。 Additionally there is a shorter and more elegant way of getting all intended rows. 此外,还有一种更短,更优雅的方式来获取所有预期的行。

# getting data
#...
from bs4 import BeautifulSoup
from collections import namedtuple
soup = BeautifulSoup(data)
# thats where the data are collected
rows = list()
# named tuple to store the relevant data of one player
Player = namedtuple('Player', ['name', 'team', 'goals', 'assists', 'points'])
# getting every row of the tbody in the specified table
for tr in soup.select('table.data.stats tbody tr'):
    # put text-contents of the row in a list
    cellStrings = [cell.find(text = True) for cell in tr.findAll('td')]
    # add it to the
    rows.append(
        Player(
            name=cellStrings[1],
            team=cellStrings[2],
            goals=cellStrings[5],
            assists=cellStrings[6],
            points=cellStrings[7]
        )
    )

rows looks like that rows看起来像那样

[Player(name=u'Steven Stamkos', team=u'TBL', goals=u'14', assists=u'9', points=u'23'),
 Player(name=u'Sidney Crosby', team=u'PIT', goals=u'8', assists=u'15', points=u'23'),
 Player(name=u'Ryan Getzlaf', team=u'ANA', goals=u'10', assists=u'12', points=u'22'),
 Player(name=u'Alexander Steen', team=u'STL', goals=u'14', assists=u'7', points=u'21'),
 Player(name=u'Corey Perry', team=u'ANA', goals=u'11', assists=u'10', points=u'21'),
 Player(name=u'Alex Ovechkin', team=u'WSH', goals=u'13', assists=u'7', points=u'20'),
 ....

Access like that 这样访问

>>> rows[20].name
u'Bryan Little'

You have not mentioned exactly what data you need, but you can go ahead on these lines: 您没有确切提到所需的数据,但是可以按照以下步骤进行操作:

from BeautifulSoup import BeautifulSoup
...
table = soup.find('table', {'class': 'data stats'})
rows = table.find('tr')
for row in rows:
    cols = row.findAll('td')
    for col in cols:
        print col.text
        link = col.find("a")
        if link:
            print link.get("href"), link.get("rel"), link.get("onclick"), link.text

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM