簡體   English   中英

在href(不是url)之后提取數據的漂亮湯

[英]Beautiful Soup Extracting Data After href (not url)

我是使用BeautifulSoup的新手,並且嘗試使用它從NHL.com獲取一些測試數據。 到目前為止,這是我的代碼,但是我很迷路...

這是我要從中提取數據的HTML代碼的片段:

<tr>
    <td rowspan="1" colspan="1"> … </td>
    <td style="text-align: left;" rowspan="1" colspan="1">
        <a href="/ice/player.htm?id=8474564">

            Steven Stamkos

        </a>
    </td>
    <td style="text-align: center;" rowspan="1" colspan="1">
        <a href="javascript:void(0);" rel="TBL" onclick="loadTeamSpotlight(jQuery(this));" style="border-bottom:1px dotted;">

            TBL

        </a>
    </td>
    <td style="text-align: center;" rowspan="1" colspan="1">

        C

    </td>
    <td style="center" rowspan="1" colspan="1">

        16

    </td>
    <td style="center" rowspan="1" colspan="1">

        14

    </td>
    <td style="center" rowspan="1" colspan="1">

        9

    </td>

我想從整個頁面的這些字段中提取數據,因此大約有30個不同的表行。 到目前為止,這是我的Python代碼,我不確定該去哪里。

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.nhl.com/ice/playerstats.htm?fetchKey=20142ALLSASAll&viewName=summary&sort=points&pg=1")

data = r.text
t_data=[]
soup = BeautifulSoup(data)
table = soup.find('table', {'class': 'data stats'})

我知道不多,但我不知道該如何處理。 謝謝大家的幫助

編輯:我解決了問題,並希望這會在將來對任何人有幫助。 這是我的代碼:

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.nhl.com/ice/playerstats.htm?fetchKey=20142ALLSASAll&viewName=summary&sort=points&pg=1")

player=[]
team=[]
goals=[]
assists=[]
cells=[]
points=[]
i=0
data = r.text
soup = BeautifulSoup(data)
table = soup.find('table', {'class': 'data stats'})
row=[]
for rows in table.find_all('tr'):
    cells=rows.find_all('td')
    if(len(cells)==19):
        player.append(cells[1].find(text=True))
        team.append(cells[2].find(text=True))
        goals.append(cells[5].find(text=True))
        assists.append(cells[6].find(text=True))
        points.append(cells[7].find(text=True))
        print(player[i],team[i],goals[i],assists[i],points[i])
        i=i+1

我只是想發布另一種方法,所以您不必使用6個不同的列表來存儲連接的數據。 此外,還有一種更短,更優雅的方式來獲取所有預期的行。

# getting data
#...
from bs4 import BeautifulSoup
from collections import namedtuple
soup = BeautifulSoup(data)
# thats where the data are collected
rows = list()
# named tuple to store the relevant data of one player
Player = namedtuple('Player', ['name', 'team', 'goals', 'assists', 'points'])
# getting every row of the tbody in the specified table
for tr in soup.select('table.data.stats tbody tr'):
    # put text-contents of the row in a list
    cellStrings = [cell.find(text = True) for cell in tr.findAll('td')]
    # add it to the
    rows.append(
        Player(
            name=cellStrings[1],
            team=cellStrings[2],
            goals=cellStrings[5],
            assists=cellStrings[6],
            points=cellStrings[7]
        )
    )

rows看起來像那樣

[Player(name=u'Steven Stamkos', team=u'TBL', goals=u'14', assists=u'9', points=u'23'),
 Player(name=u'Sidney Crosby', team=u'PIT', goals=u'8', assists=u'15', points=u'23'),
 Player(name=u'Ryan Getzlaf', team=u'ANA', goals=u'10', assists=u'12', points=u'22'),
 Player(name=u'Alexander Steen', team=u'STL', goals=u'14', assists=u'7', points=u'21'),
 Player(name=u'Corey Perry', team=u'ANA', goals=u'11', assists=u'10', points=u'21'),
 Player(name=u'Alex Ovechkin', team=u'WSH', goals=u'13', assists=u'7', points=u'20'),
 ....

這樣訪問

>>> rows[20].name
u'Bryan Little'

您沒有確切提到所需的數據,但是可以按照以下步驟進行操作:

from BeautifulSoup import BeautifulSoup
...
table = soup.find('table', {'class': 'data stats'})
rows = table.find('tr')
for row in rows:
    cols = row.findAll('td')
    for col in cols:
        print col.text
        link = col.find("a")
        if link:
            print link.get("href"), link.get("rel"), link.get("onclick"), link.text

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM