簡體   English   中英

使用 Beautiful Soup 和 Python 解析標簽

[英]Parsing tags using Beautiful Soup and Python

到目前為止,這是我的代碼:

# URL page we will scraping (see image above)
url = "https://www.basketball-reference.com/leagues/NBA_2019_per_game.html"
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html)
soup.findAll('tr', limit=10)

它返回

<th aria-label="Personal Fouls Per Game" class=" poptip hide_non_quals center" data-stat="pf_per_g" data-tip="Personal Fouls Per Game" scope="col">PF</th>
 <th aria-label="Points Per Game" class=" poptip hide_non_quals center" data-stat="pts_per_g" data-tip="Points Per Game" scope="col">PTS</th>
 </tr>,
 <tr class="full_table"><th class="right " csk="1" data-stat="ranker" scope="row">1</th><td class="left " csk="Abrines,Álex" data-append-csv="abrinal01" data-stat="player"><a href="/players/a/abrinal01.html">Álex Abrines</a></td><td class="center " data-stat="pos">SG</td><td class="right " data-stat="age">25</td><td class="left " data-stat="team_id"><a href="/teams/OKC/2019.html">OKC</a></td><td class="right " data-stat="g">31</td><td class="right " data-stat="gs">2</td><td class="right non_qual" data-stat="mp_per_g">19.0</td><td class="right non_qual" data-stat="fg_per_g">1.8</td><td class="right non_qual" data-stat="fga_per_g">5.1</td><td class="right non_qual" data-stat="fg_pct">.357</td><td class="right non_qual" data-stat="fg3_per_g">1.3</td><td class="right non_qual" data-stat="fg3a_per_g">4.1</td><td class="right non_qual" data-stat="fg3_pct">.323</td><td class="right non_qual" data-stat="fg2_per_g">0.5</td><td class="right non_qual" data-stat="fg2a_per_g">1.0</td><td class="right non_qual" data-stat="fg2_pct">.500</td><td class="right non_qual" data-stat="efg_pct">.487</td><td class="right non_qual" data-stat="ft_per_g">0.4</td><td class="right non_qual" data-stat="fta_per_g">0.4</td><td class="right non_qual" data-stat="ft_pct">.923</td><td class="right non_qual" data-stat="orb_per_g">0.2</td><td class="right non_qual" data-stat="drb_per_g">1.4</td><td class="right non_qual" data-stat="trb_per_g">1.5</td><td class="right non_qual" data-stat="ast_per_g">0.6</td><td class="right non_qual" data-stat="stl_per_g">0.5</td><td class="right non_qual" data-stat="blk_per_g">0.2</td><td class="right non_qual" data-stat="tov_per_g">0.5</td><td class="right non_qual" data-stat="pf_per_g">1.7</td><td class="right non_qual" data-stat="pts_per_g">5.3</td></tr>,
 <tr class="full_table"><th class="right " csk="2" data-stat="ranker" scope="row">2</th><td class="left " csk="Acy,Quincy" data-append-csv="acyqu01" data-stat="player"><a href="/players/a/acyqu01.html">Quincy Acy</a></td><td class="center " data-stat="pos">PF</td><td class="right " data-stat="age">28</td><td class="left " data-stat="team_id"><a href="/teams/PHO/2019.html">PHO</a></td><td class="right " data-stat="g">10</td><td class="right iz" data-stat="gs">0</td><td class="right non_qual" data-stat="mp_per_g">12.3</td><td class="right non_qual" data-stat="fg_per_g">0.4</td><td class="right non_qual" data-stat="fga_per_g">1.8</td><td class="right non_qual" data-stat="fg_pct">.222</td><td class="right non_qual" data-stat="fg3_per_g">0.2</td><td class="right non_qual" data-stat="fg3a_per_g">1.5</td><td class="right non_qual" data-stat="fg3_pct">.133</td><td class="right non_qual" data-stat="fg2_per_g">0.2</td><td class="right non_qual" data-stat="fg2a_per_g">0.3</td><td class="right non_qual" data-stat="fg2_pct">.667</td><td class="right non_qual" data-stat="efg_pct">.278</td><td class="right non_qual" data-stat="ft_per_g">0.7</td><td class="right non_qual" data-stat="fta_per_g">1.0</td><td class="right non_qual" data-stat="ft_pct">.700</td><td class="right non_qual" data-stat="orb_per_g">0.3</td><td class="right non_qual" data-stat="drb_per_g">2.2</td><td class="right non_qual" data-stat="trb_per_g">2.5</td><td class="right non_qual" data-stat="ast_per_g">0.8</td><td class="right non_qual" data-stat="stl_per_g">0.1</td><td class="right non_qual" data-stat="blk_per_g">0.4</td><td class="right non_qual" data-stat="tov_per_g">0.4</td><td class="right non_qual" data-stat="pf_per_g">2.4</td><td class="right non_qual" data-stat="pts_per_g">1.7</td></tr>,
 <tr class="full_table"><th class="right " csk="3" data-stat="ranker" scope="row">3</th><td class="left " csk="Adams,Jaylen" data-append-csv="adamsja01" data-stat="player"><a href="/players/a/adamsja01.html">Jaylen Adams</a></td><td class="center " data-stat="pos">PG</td><td class="right " data-stat="age">22</td><td class="left " data-stat="team_id"><a href="/teams/ATL/2019.html">ATL</a></td><td class="right " data-stat="g">34</td><td class="right " data-stat="gs">1</td><td class="right non_qual" data-stat="mp_per_g">12.6</td><td class="right non_qual" data-stat="fg_per_g">1.1</td><td class="right non_qual" data-stat="fga_per_g">3.2</td><td class="right non_qual" data-stat="fg_pct">.345</td><td class="right non_qual" data-stat="fg3_per_g">0.7</td><td class="right non_qual" data-stat="fg3a_per_g">2.2</td><td class="right non_qual" data-stat="fg3_pct">.338</td><td class="right non_qual" data-stat="fg2_per_g">0.4</td><td class="right non_qual" data-stat="fg2a_per_g">1.1</td><td class="right non_qual" data-stat="fg2_pct">.361</td><td class="right non_qual" data-stat="efg_pct">.459</td><td class="right non_qual" data-stat="ft_per_g">0.2</td><td class="right non_qual" data-stat="fta_per_g">0.3</td><td class="right non_qual" data-stat="ft_pct">.778</td><td class="right non_qual" data-stat="orb_per_g">0.3</td><td class="right non_qual" data-stat="drb_per_g">1.4</td><td class="right non_qual" data-stat="trb_per_g">1.8</td><td class="right non_qual" data-stat="ast_per_g">1.9</td><td class="right non_qual" data-stat="stl_per_g">0.4</td><td class="right non_qual" data-stat="blk_per_g">0.1</td><td class="right non_qual" data-stat="tov_per_g">0.8</td><td class="right non_qual" data-stat="pf_per_g">1.3</td><td class="right non_qual" data-stat="pts_per_g">3.2</td></tr>,
 <tr class="full_table"><th class="right " csk="4" data-stat="ranker" scope="row">4</th><td class="left " csk="Adams,Steven" data-append-csv="adamsst01"

我想知道每個 tr class 如何獲得 a href 和 data-append-csv。 因此,例如第一個 tr class,data-append-csv 是 abrinal01。

對於快速解決方案,您可以嘗試以下方法:

import re

tags = page_soup.find_all('tr', limit=10)

for tag in tags:
    m = re.match('.+" data-append-csv="([^\"]+)" ', str(tag))
    if m:
        ge = m.groups()
        print(ge[0])

與 href 相同的方法。 對於全局/重用解決方案,您需要更准確的代碼和湯解析或更准確的正則表達式

如果您只想要data-append-csvhref值,那么您可以使用我的代碼。 我將列表推導與find一起使用。

代碼

from bs4 import BeautifulSoup
import requests

txt = '''
<th aria-label="Personal Fouls Per Game" class=" poptip hide_non_quals center" data-stat="pf_per_g" data-tip="Personal Fouls Per Game" scope="col">PF</th>
 <th aria-label="Points Per Game" class=" poptip hide_non_quals center" data-stat="pts_per_g" data-tip="Points Per Game" scope="col">PTS</th>
 </tr>,
 <tr class="full_table"><th class="right " csk="1" data-stat="ranker" scope="row">1</th><td class="left " csk="Abrines,Álex" data-append-csv="abrinal01" data-stat="player"><a href="/players/a/abrinal01.html">Álex Abrines</a></td><td class="center " data-stat="pos">SG</td><td class="right " data-stat="age">25</td><td class="left " data-stat="team_id"><a href="/teams/OKC/2019.html">OKC</a></td><td class="right " data-stat="g">31</td><td class="right " data-stat="gs">2</td><td class="right non_qual" data-stat="mp_per_g">19.0</td><td class="right non_qual" data-stat="fg_per_g">1.8</td><td class="right non_qual" data-stat="fga_per_g">5.1</td><td class="right non_qual" data-stat="fg_pct">.357</td><td class="right non_qual" data-stat="fg3_per_g">1.3</td><td class="right non_qual" data-stat="fg3a_per_g">4.1</td><td class="right non_qual" data-stat="fg3_pct">.323</td><td class="right non_qual" data-stat="fg2_per_g">0.5</td><td class="right non_qual" data-stat="fg2a_per_g">1.0</td><td class="right non_qual" data-stat="fg2_pct">.500</td><td class="right non_qual" data-stat="efg_pct">.487</td><td class="right non_qual" data-stat="ft_per_g">0.4</td><td class="right non_qual" data-stat="fta_per_g">0.4</td><td class="right non_qual" data-stat="ft_pct">.923</td><td class="right non_qual" data-stat="orb_per_g">0.2</td><td class="right non_qual" data-stat="drb_per_g">1.4</td><td class="right non_qual" data-stat="trb_per_g">1.5</td><td class="right non_qual" data-stat="ast_per_g">0.6</td><td class="right non_qual" data-stat="stl_per_g">0.5</td><td class="right non_qual" data-stat="blk_per_g">0.2</td><td class="right non_qual" data-stat="tov_per_g">0.5</td><td class="right non_qual" data-stat="pf_per_g">1.7</td><td class="right non_qual" data-stat="pts_per_g">5.3</td></tr>,
 <tr class="full_table"><th class="right " csk="2" data-stat="ranker" scope="row">2</th><td class="left " csk="Acy,Quincy" data-append-csv="acyqu01" data-stat="player"><a href="/players/a/acyqu01.html">Quincy Acy</a></td><td class="center " data-stat="pos">PF</td><td class="right " data-stat="age">28</td><td class="left " data-stat="team_id"><a href="/teams/PHO/2019.html">PHO</a></td><td class="right " data-stat="g">10</td><td class="right iz" data-stat="gs">0</td><td class="right non_qual" data-stat="mp_per_g">12.3</td><td class="right non_qual" data-stat="fg_per_g">0.4</td><td class="right non_qual" data-stat="fga_per_g">1.8</td><td class="right non_qual" data-stat="fg_pct">.222</td><td class="right non_qual" data-stat="fg3_per_g">0.2</td><td class="right non_qual" data-stat="fg3a_per_g">1.5</td><td class="right non_qual" data-stat="fg3_pct">.133</td><td class="right non_qual" data-stat="fg2_per_g">0.2</td><td class="right non_qual" data-stat="fg2a_per_g">0.3</td><td class="right non_qual" data-stat="fg2_pct">.667</td><td class="right non_qual" data-stat="efg_pct">.278</td><td class="right non_qual" data-stat="ft_per_g">0.7</td><td class="right non_qual" data-stat="fta_per_g">1.0</td><td class="right non_qual" data-stat="ft_pct">.700</td><td class="right non_qual" data-stat="orb_per_g">0.3</td><td class="right non_qual" data-stat="drb_per_g">2.2</td><td class="right non_qual" data-stat="trb_per_g">2.5</td><td class="right non_qual" data-stat="ast_per_g">0.8</td><td class="right non_qual" data-stat="stl_per_g">0.1</td><td class="right non_qual" data-stat="blk_per_g">0.4</td><td class="right non_qual" data-stat="tov_per_g">0.4</td><td class="right non_qual" data-stat="pf_per_g">2.4</td><td class="right non_qual" data-stat="pts_per_g">1.7</td></tr>,
 <tr class="full_table"><th class="right " csk="3" data-stat="ranker" scope="row">3</th><td class="left " csk="Adams,Jaylen" data-append-csv="adamsja01" data-stat="player"><a href="/players/a/adamsja01.html">Jaylen Adams</a></td><td class="center " data-stat="pos">PG</td><td class="right " data-stat="age">22</td><td class="left " data-stat="team_id"><a href="/teams/ATL/2019.html">ATL</a></td><td class="right " data-stat="g">34</td><td class="right " data-stat="gs">1</td><td class="right non_qual" data-stat="mp_per_g">12.6</td><td class="right non_qual" data-stat="fg_per_g">1.1</td><td class="right non_qual" data-stat="fga_per_g">3.2</td><td class="right non_qual" data-stat="fg_pct">.345</td><td class="right non_qual" data-stat="fg3_per_g">0.7</td><td class="right non_qual" data-stat="fg3a_per_g">2.2</td><td class="right non_qual" data-stat="fg3_pct">.338</td><td class="right non_qual" data-stat="fg2_per_g">0.4</td><td class="right non_qual" data-stat="fg2a_per_g">1.1</td><td class="right non_qual" data-stat="fg2_pct">.361</td><td class="right non_qual" data-stat="efg_pct">.459</td><td class="right non_qual" data-stat="ft_per_g">0.2</td><td class="right non_qual" data-stat="fta_per_g">0.3</td><td class="right non_qual" data-stat="ft_pct">.778</td><td class="right non_qual" data-stat="orb_per_g">0.3</td><td class="right non_qual" data-stat="drb_per_g">1.4</td><td class="right non_qual" data-stat="trb_per_g">1.8</td><td class="right non_qual" data-stat="ast_per_g">1.9</td><td class="right non_qual" data-stat="stl_per_g">0.4</td><td class="right non_qual" data-stat="blk_per_g">0.1</td><td class="right non_qual" data-stat="tov_per_g">0.8</td><td class="right non_qual" data-stat="pf_per_g">1.3</td><td class="right non_qual" data-stat="pts_per_g">3.2</td></tr>,
 <tr class="full_table"><th class="right " csk="4" data-stat="ranker" scope="row">4</th><td class="left " csk="Adams,Steven" data-append-csv="adamsst01"
 '''

#main scrape
bs = BeautifulSoup(txt, 'lxml')

#you may uncomment the following three lines to scrape directly from your url, the print results will be different
#url = 'https://www.basketball-reference.com/leagues/NBA_2019_per_game.html'
#html = requests.get(url)
#bs = BeautifulSoup(html.content, 'lxml')

tr = bs.find_all('tr')

#data-append-csv is part of <td class='left', ..., data-append-csv=...>
dacsv = [_.find('td', {'class':'left'})['data-append-csv'] if _.find('td') is not None else None for _ in tr]

#href is part of <a href=...>
href = [_.find('a')['href'] if _.find('a') is not None else None for _ in tr]

print(list(zip(dacsv, href)))

#[('abrinal01', '/players/a/abrinal01.html'), ('acyqu01', '/players/a/acyqu01.html'), ('adamsja01', '/players/a/adamsja01.html'), ('adamsst01', None)]

注意:如果你想從一個ID中查看所有屬性,你可以執行以下操作(然后調用你想要的屬性)

temp = [_.find('td', {'class':'left'}).attrs if _.find('td') is not None else None for _ in tr]

print(temp)
#[{'class': ['left'], 'csk': 'Abrines,Álex', 'data-append-csv': 'abrinal01', 'data-stat': 'player'}, {'class': ['left'], 'csk': 'Acy,Quincy', 'data-append-csv': 'acyqu01', 'data-stat': 'player'}, {'class': ['left'], 'csk': 'Adams,Jaylen', 'data-append-csv': 'adamsja01', 'data-stat': 'player'}, {'class': ['left'], 'csk': 'Adams,Steven', 'data-append-csv': 'adamsst01'}]

如果你只是想提取那些 data_append_csv 和 href 你可以 zip 兩個匹配的列表然后在循環中提取。 我會調查是否可以使用完整的 html 刪除.left class 選擇器。

from bs4 import BeautifulSoup as bs

html = '''
<html>
 <head></head>
 <body>
  <table> 
   <tbody>
    <tr> 
     <th aria-label="Personal Fouls Per Game" class=" poptip hide_non_quals center" data-stat="pf_per_g" data-tip="Personal Fouls Per Game" scope="col">PF</th> 
     <th aria-label="Points Per Game" class=" poptip hide_non_quals center" data-stat="pts_per_g" data-tip="Points Per Game" scope="col">PTS</th> 
    </tr> 
    <tr class="full_table"> 
     <th class="right " csk="1" data-stat="ranker" scope="row">1</th> 
     <td class="left " csk="Abrines,Álex" data-append-csv="abrinal01" data-stat="player"><a href="/players/a/abrinal01.html">Álex Abrines</a></td> 
     <td class="center " data-stat="pos">SG</td> 
     <td class="right " data-stat="age">25</td> 
     <td class="left " data-stat="team_id"><a href="/teams/OKC/2019.html">OKC</a></td> 
     <td class="right " data-stat="g">31</td> 
     <td class="right " data-stat="gs">2</td> 
     <td class="right non_qual" data-stat="mp_per_g">19.0</td> 
     <td class="right non_qual" data-stat="fg_per_g">1.8</td> 
     <td class="right non_qual" data-stat="fga_per_g">5.1</td> 
     <td class="right non_qual" data-stat="fg_pct">.357</td> 
     <td class="right non_qual" data-stat="fg3_per_g">1.3</td> 
     <td class="right non_qual" data-stat="fg3a_per_g">4.1</td> 
     <td class="right non_qual" data-stat="fg3_pct">.323</td> 
     <td class="right non_qual" data-stat="fg2_per_g">0.5</td> 
     <td class="right non_qual" data-stat="fg2a_per_g">1.0</td> 
     <td class="right non_qual" data-stat="fg2_pct">.500</td> 
     <td class="right non_qual" data-stat="efg_pct">.487</td> 
     <td class="right non_qual" data-stat="ft_per_g">0.4</td> 
     <td class="right non_qual" data-stat="fta_per_g">0.4</td> 
     <td class="right non_qual" data-stat="ft_pct">.923</td> 
     <td class="right non_qual" data-stat="orb_per_g">0.2</td> 
     <td class="right non_qual" data-stat="drb_per_g">1.4</td> 
     <td class="right non_qual" data-stat="trb_per_g">1.5</td> 
     <td class="right non_qual" data-stat="ast_per_g">0.6</td> 
     <td class="right non_qual" data-stat="stl_per_g">0.5</td> 
     <td class="right non_qual" data-stat="blk_per_g">0.2</td> 
     <td class="right non_qual" data-stat="tov_per_g">0.5</td> 
     <td class="right non_qual" data-stat="pf_per_g">1.7</td> 
     <td class="right non_qual" data-stat="pts_per_g">5.3</td> 
    </tr> 
    <tr class="full_table"> 
     <th class="right " csk="2" data-stat="ranker" scope="row">2</th> 
     <td class="left " csk="Acy,Quincy" data-append-csv="acyqu01" data-stat="player"><a href="/players/a/acyqu01.html">Quincy Acy</a></td> 
     <td class="center " data-stat="pos">PF</td> 
     <td class="right " data-stat="age">28</td> 
     <td class="left " data-stat="team_id"><a href="/teams/PHO/2019.html">PHO</a></td> 
     <td class="right " data-stat="g">10</td> 
     <td class="right iz" data-stat="gs">0</td> 
     <td class="right non_qual" data-stat="mp_per_g">12.3</td> 
     <td class="right non_qual" data-stat="fg_per_g">0.4</td> 
     <td class="right non_qual" data-stat="fga_per_g">1.8</td> 
     <td class="right non_qual" data-stat="fg_pct">.222</td> 
     <td class="right non_qual" data-stat="fg3_per_g">0.2</td> 
     <td class="right non_qual" data-stat="fg3a_per_g">1.5</td> 
     <td class="right non_qual" data-stat="fg3_pct">.133</td> 
     <td class="right non_qual" data-stat="fg2_per_g">0.2</td> 
     <td class="right non_qual" data-stat="fg2a_per_g">0.3</td> 
     <td class="right non_qual" data-stat="fg2_pct">.667</td> 
     <td class="right non_qual" data-stat="efg_pct">.278</td> 
     <td class="right non_qual" data-stat="ft_per_g">0.7</td> 
     <td class="right non_qual" data-stat="fta_per_g">1.0</td> 
     <td class="right non_qual" data-stat="ft_pct">.700</td> 
     <td class="right non_qual" data-stat="orb_per_g">0.3</td> 
     <td class="right non_qual" data-stat="drb_per_g">2.2</td> 
     <td class="right non_qual" data-stat="trb_per_g">2.5</td> 
     <td class="right non_qual" data-stat="ast_per_g">0.8</td> 
     <td class="right non_qual" data-stat="stl_per_g">0.1</td> 
     <td class="right non_qual" data-stat="blk_per_g">0.4</td> 
     <td class="right non_qual" data-stat="tov_per_g">0.4</td> 
     <td class="right non_qual" data-stat="pf_per_g">2.4</td> 
     <td class="right non_qual" data-stat="pts_per_g">1.7</td> 
    </tr> 
    <tr class="full_table"> 
     <th class="right " csk="3" data-stat="ranker" scope="row">3</th> 
     <td class="left " csk="Adams,Jaylen" data-append-csv="adamsja01" data-stat="player"><a href="/players/a/adamsja01.html">Jaylen Adams</a></td> 
     <td class="center " data-stat="pos">PG</td> 
     <td class="right " data-stat="age">22</td> 
     <td class="left " data-stat="team_id"><a href="/teams/ATL/2019.html">ATL</a></td> 
     <td class="right " data-stat="g">34</td> 
     <td class="right " data-stat="gs">1</td> 
     <td class="right non_qual" data-stat="mp_per_g">12.6</td> 
     <td class="right non_qual" data-stat="fg_per_g">1.1</td> 
     <td class="right non_qual" data-stat="fga_per_g">3.2</td> 
     <td class="right non_qual" data-stat="fg_pct">.345</td> 
     <td class="right non_qual" data-stat="fg3_per_g">0.7</td> 
     <td class="right non_qual" data-stat="fg3a_per_g">2.2</td> 
     <td class="right non_qual" data-stat="fg3_pct">.338</td> 
     <td class="right non_qual" data-stat="fg2_per_g">0.4</td> 
     <td class="right non_qual" data-stat="fg2a_per_g">1.1</td> 
     <td class="right non_qual" data-stat="fg2_pct">.361</td> 
     <td class="right non_qual" data-stat="efg_pct">.459</td> 
     <td class="right non_qual" data-stat="ft_per_g">0.2</td> 
     <td class="right non_qual" data-stat="fta_per_g">0.3</td> 
     <td class="right non_qual" data-stat="ft_pct">.778</td> 
     <td class="right non_qual" data-stat="orb_per_g">0.3</td> 
     <td class="right non_qual" data-stat="drb_per_g">1.4</td> 
     <td class="right non_qual" data-stat="trb_per_g">1.8</td> 
     <td class="right non_qual" data-stat="ast_per_g">1.9</td> 
     <td class="right non_qual" data-stat="stl_per_g">0.4</td> 
     <td class="right non_qual" data-stat="blk_per_g">0.1</td> 
     <td class="right non_qual" data-stat="tov_per_g">0.8</td> 
     <td class="right non_qual" data-stat="pf_per_g">1.3</td> 
     <td class="right non_qual" data-stat="pts_per_g">3.2</td> 
    </tr>
   </tbody>
  </table>
 </body>
</html>
'''
soup = bs(html, 'lxml')

for name, link in zip(soup.select('[data-append-csv].left'),soup.select('[data-append-csv].left a')): #you may wish to add td in
    print(name['data-append-csv'], link['href'])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM