簡體   English   中英

使用python3從web頁面抓取特定表格(網頁有多個表格)

[英]Scraping a specific table from a web page using python3 (web page has multiple tables)

我正在嘗試從 web 頁面上的特定表中提取數據。 頁面上有多個表,所以我試圖使用表 ID 僅提取所需的表。

url: https://basketball.realgm.com/player/Luke-Nelson/Summary/50483

我到目前為止的代碼如下。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import ssl


# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#URL input
url = 'https://basketball.realgm.com/player/Luke-Nelson/Summary/50483'
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

table = soup.find('table', id='table-1696')
print(table)

我假設 print 語句會從表中打印 HTML(之前只在一張表上工作)但是當我運行程序時它有以下 output:

終端 Output

最終我的目標是在 python 中重新創建表並導出到 excel,但無法克服第一個障礙!

這是網頁內表格的 HTML

 <table class="tablesaw compact tablesaw-swipe tablesaw-sortable" data.tablesaw-mode="swipe" data.tablesaw-mode-switch="" data.tablesaw-mode-exclude="columntoggle" data.tablesaw-sortable="" data.tablesaw-sortable-switch="" id="table-1696" style=""> <thead><tr class="per_game per_48 per_40 per_36 per_minute minute_per total"> <th data.tablesaw-sortable-col="" data.tablesaw-priority="persist" data.tablesaw-sortable-default-col="" class="tablesaw-cell-persist tablesaw-sortable-head tablesaw-sortable-ascending"><button class="tablesaw-sortable-btn">Season</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">Team</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">League</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">GP</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">GS</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">MIN</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FGM</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FGA</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FG%</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">3PM</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">3PA</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">3P%</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FTM</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FTA</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FT%</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">OFF</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">DEF</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">TRB</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">AST</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">STL</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head tablesaw-cell-hidden"><button class="tablesaw-sortable-btn">BLK</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head tablesaw-cell-hidden"><button class="tablesaw-sortable-btn">PF</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head tablesaw-cell-hidden"><button class="tablesaw-sortable-btn">TOV</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head tablesaw-cell-hidden"><button class="tablesaw-sortable-btn">PTS</button></th> </tr></thead><tbody><tr class="per_game"> <td class="tablesaw-cell-persist">2012-13</td> <td id="teamLineinternational_reg_Per_Game_1"><a href="/international/league/47/adidas-Next-Generation-Tournament/team/1304/Team-England-U18-Men">Team England U18 Men</a></td> <td><a href="/international/league/47/adidas-Next-Generation-Tournament">ANGT</a></td> <td>3</td> <td>3</td> <td>33.3</td> <td>6.00</td> <td>16.33</td> <td>.367</td> <td>1.33</td> <td>4.33</td> <td>.308</td> <td>2.33</td> <td>2.67</td> <td>.875</td> <td>0.00</td> <td>3.33</td> <td>3.33</td> <td>5.67</td> <td>2.00</td> <td class="tablesaw-cell-hidden">0.33</td> <td class="tablesaw-cell-hidden">3.00</td> <td class="tablesaw-cell-hidden">3.67</td> <td class="tablesaw-cell-hidden">15.67</td> </tr> <tr class="per_game"> <td class="tablesaw-cell-persist">2017-18</td> <td id="teamLineinternational_reg_Per_Game_2"><a href="/international/league/4/Spanish-ACB/team/212/Coosur-Real-Betis">Coosur Real Betis</a></td> <td><a href="/international/league/4/Spanish-ACB">ACB</a></td> <td>34</td> <td>28</td> <td>23.2</td> <td>2.97</td> <td>6.74</td> <td>.441</td> <td>1.47</td> <td>3.59</td> <td>.410</td> <td>0.79</td> <td>1.03</td> <td>.771</td> <td>0.24</td> <td>1.91</td> <td>2.15</td> <td>1.68</td> <td>1.06</td> <td class="tablesaw-cell-hidden">0.03</td> <td class="tablesaw-cell-hidden">3.00</td> <td class="tablesaw-cell-hidden">1.82</td> <td class="tablesaw-cell-hidden">8.21</td> </tr> <tr class="per_game"> <td class="tablesaw-cell-persist">2019-20 *</td> <td id="teamLineinternational_reg_Per_Game_3">All Teams</td> <td>All Leagues</td> <td>17</td> <td>5</td> <td>16.7</td> <td>2.82</td> <td>7.29</td> <td>.387</td> <td>1.35</td> <td>3.88</td> <td>.348</td> <td>1.35</td> <td>1.59</td> <td>.852</td> <td>0.24</td> <td>0.94</td> <td>1.18</td> <td>2.47</td> <td>0.71</td> <td class="tablesaw-cell-hidden">0.18</td> <td class="tablesaw-cell-hidden">2.24</td> <td class="tablesaw-cell-hidden">1.59</td> <td class="tablesaw-cell-hidden">8.35</td> </tr> <tr class="per_game multiple-teams-highlight"> <td class="tablesaw-cell-persist">2019-20 *</td> <td id="teamLineinternational_reg_Per_Game_4"><a href="/international/league/4/Spanish-ACB/team/473/ICL-Manresa">ICL Manresa</a></td> <td><a href="/international/league/4/Spanish-ACB">ACB</a></td> <td>9</td> <td>1</td> <td>13.6</td> <td>1.78</td> <td>5.56</td> <td>.320</td> <td>0.56</td> <td>2.89</td> <td>.192</td> <td>1.56</td> <td>1.67</td> <td>.933</td> <td>0.33</td> <td>0.78</td> <td>1.11</td> <td>1.89</td> <td>0.22</td> <td class="tablesaw-cell-hidden">0.00</td> <td class="tablesaw-cell-hidden">1.89</td> <td class="tablesaw-cell-hidden">1.56</td> <td class="tablesaw-cell-hidden">5.67</td> </tr> <tr class="per_game multiple-teams-highlight"> <td class="tablesaw-cell-persist">2019-20 *</td> <td id="teamLineinternational_reg_Per_Game_5"><a href="/international/league/106/Basketball-Champions-League-Europe/team/473/ICL-Manresa">ICL Manresa</a></td> <td><a href="/international/league/106/Basketball-Champions-League-Europe">BCL-Eu</a></td> <td>8</td> <td>4</td> <td>20.3</td> <td>4.00</td> <td>9.25</td> <td>.432</td> <td>2.25</td> <td>5.00</td> <td>.450</td> <td>1.12</td> <td>1.50</td> <td>.750</td> <td>0.12</td> <td>1.12</td> <td>1.25</td> <td>3.12</td> <td>1.25</td> <td class="tablesaw-cell-hidden">0.38</td> <td class="tablesaw-cell-hidden">2.62</td> <td class="tablesaw-cell-hidden">1.62</td> <td class="tablesaw-cell-hidden">11.38</td> </tr> </tbody> <tfoot></tfoot> </table>

感謝您花時間閱讀我的問題,希望我已經充分解釋了它,我是編碼/編程的新手(幾周前開始),所以請在回復時牢記這一點。 再次感謝!

您可以使用 pandas:

import pandas as pd

df = pd.read_html(url) # df -> list of tables

print(len(df)) # 29 

你可以選擇你想要的表格。

表 ID 是動態分配的,因此我建議使用另一種方法來訪問您的表。 假設您想獲取 NBA 夏季聯賽統計數據 - 總計的表格,請嘗試:

table_heading = 'NBA Summer League Stats - Totals'
table = soup.find(string=re.compile(table_heading))
          .find_parent()
          .find_next_sibling()
print(table)

您可以更改表格中其他標題的table_heading 讓我知道是否有幫助。

使用 pandas 獲取表格標簽並將 id 屬性用於 select 您想要的那個:

import pandas as pd

url = 'https://basketball.realgm.com/player/Luke-Nelson/Summary/50483'
df = pd.read_html(url, attrs={'id':'table-1696'})[0]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM