简体   繁体   中英

Scraping a specific table from a web page using python3 (web page has multiple tables)

I am trying to extra the data from a specific table on a web page. There are multiple tables on the page so I am trying to use the table ID to extract only the required table.

url: https://basketball.realgm.com/player/Luke-Nelson/Summary/50483

The code I have so far is the following.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import ssl


# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#URL input
url = 'https://basketball.realgm.com/player/Luke-Nelson/Summary/50483'
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

table = soup.find('table', id='table-1696')
print(table)

I have assumed the print statement would print the HTML from the table (has previously worked just on one table) but when I run the programme it has the following output:

Terminal Output

Ultimately I'm aiming to re-create the table in python and export to excel, but can't get over this first hurdle!

Here is the HTML for the table within the webpage

 <table class="tablesaw compact tablesaw-swipe tablesaw-sortable" data.tablesaw-mode="swipe" data.tablesaw-mode-switch="" data.tablesaw-mode-exclude="columntoggle" data.tablesaw-sortable="" data.tablesaw-sortable-switch="" id="table-1696" style=""> <thead><tr class="per_game per_48 per_40 per_36 per_minute minute_per total"> <th data.tablesaw-sortable-col="" data.tablesaw-priority="persist" data.tablesaw-sortable-default-col="" class="tablesaw-cell-persist tablesaw-sortable-head tablesaw-sortable-ascending"><button class="tablesaw-sortable-btn">Season</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">Team</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">League</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">GP</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">GS</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">MIN</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FGM</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FGA</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FG%</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">3PM</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">3PA</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">3P%</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FTM</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FTA</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FT%</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">OFF</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">DEF</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">TRB</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">AST</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">STL</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head tablesaw-cell-hidden"><button class="tablesaw-sortable-btn">BLK</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head tablesaw-cell-hidden"><button class="tablesaw-sortable-btn">PF</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head tablesaw-cell-hidden"><button class="tablesaw-sortable-btn">TOV</button></th> <th data.tablesaw-sortable-col="" class="tablesaw-sortable-head tablesaw-cell-hidden"><button class="tablesaw-sortable-btn">PTS</button></th> </tr></thead><tbody><tr class="per_game"> <td class="tablesaw-cell-persist">2012-13</td> <td id="teamLineinternational_reg_Per_Game_1"><a href="/international/league/47/adidas-Next-Generation-Tournament/team/1304/Team-England-U18-Men">Team England U18 Men</a></td> <td><a href="/international/league/47/adidas-Next-Generation-Tournament">ANGT</a></td> <td>3</td> <td>3</td> <td>33.3</td> <td>6.00</td> <td>16.33</td> <td>.367</td> <td>1.33</td> <td>4.33</td> <td>.308</td> <td>2.33</td> <td>2.67</td> <td>.875</td> <td>0.00</td> <td>3.33</td> <td>3.33</td> <td>5.67</td> <td>2.00</td> <td class="tablesaw-cell-hidden">0.33</td> <td class="tablesaw-cell-hidden">3.00</td> <td class="tablesaw-cell-hidden">3.67</td> <td class="tablesaw-cell-hidden">15.67</td> </tr> <tr class="per_game"> <td class="tablesaw-cell-persist">2017-18</td> <td id="teamLineinternational_reg_Per_Game_2"><a href="/international/league/4/Spanish-ACB/team/212/Coosur-Real-Betis">Coosur Real Betis</a></td> <td><a href="/international/league/4/Spanish-ACB">ACB</a></td> <td>34</td> <td>28</td> <td>23.2</td> <td>2.97</td> <td>6.74</td> <td>.441</td> <td>1.47</td> <td>3.59</td> <td>.410</td> <td>0.79</td> <td>1.03</td> <td>.771</td> <td>0.24</td> <td>1.91</td> <td>2.15</td> <td>1.68</td> <td>1.06</td> <td class="tablesaw-cell-hidden">0.03</td> <td class="tablesaw-cell-hidden">3.00</td> <td class="tablesaw-cell-hidden">1.82</td> <td class="tablesaw-cell-hidden">8.21</td> </tr> <tr class="per_game"> <td class="tablesaw-cell-persist">2019-20 *</td> <td id="teamLineinternational_reg_Per_Game_3">All Teams</td> <td>All Leagues</td> <td>17</td> <td>5</td> <td>16.7</td> <td>2.82</td> <td>7.29</td> <td>.387</td> <td>1.35</td> <td>3.88</td> <td>.348</td> <td>1.35</td> <td>1.59</td> <td>.852</td> <td>0.24</td> <td>0.94</td> <td>1.18</td> <td>2.47</td> <td>0.71</td> <td class="tablesaw-cell-hidden">0.18</td> <td class="tablesaw-cell-hidden">2.24</td> <td class="tablesaw-cell-hidden">1.59</td> <td class="tablesaw-cell-hidden">8.35</td> </tr> <tr class="per_game multiple-teams-highlight"> <td class="tablesaw-cell-persist">2019-20 *</td> <td id="teamLineinternational_reg_Per_Game_4"><a href="/international/league/4/Spanish-ACB/team/473/ICL-Manresa">ICL Manresa</a></td> <td><a href="/international/league/4/Spanish-ACB">ACB</a></td> <td>9</td> <td>1</td> <td>13.6</td> <td>1.78</td> <td>5.56</td> <td>.320</td> <td>0.56</td> <td>2.89</td> <td>.192</td> <td>1.56</td> <td>1.67</td> <td>.933</td> <td>0.33</td> <td>0.78</td> <td>1.11</td> <td>1.89</td> <td>0.22</td> <td class="tablesaw-cell-hidden">0.00</td> <td class="tablesaw-cell-hidden">1.89</td> <td class="tablesaw-cell-hidden">1.56</td> <td class="tablesaw-cell-hidden">5.67</td> </tr> <tr class="per_game multiple-teams-highlight"> <td class="tablesaw-cell-persist">2019-20 *</td> <td id="teamLineinternational_reg_Per_Game_5"><a href="/international/league/106/Basketball-Champions-League-Europe/team/473/ICL-Manresa">ICL Manresa</a></td> <td><a href="/international/league/106/Basketball-Champions-League-Europe">BCL-Eu</a></td> <td>8</td> <td>4</td> <td>20.3</td> <td>4.00</td> <td>9.25</td> <td>.432</td> <td>2.25</td> <td>5.00</td> <td>.450</td> <td>1.12</td> <td>1.50</td> <td>.750</td> <td>0.12</td> <td>1.12</td> <td>1.25</td> <td>3.12</td> <td>1.25</td> <td class="tablesaw-cell-hidden">0.38</td> <td class="tablesaw-cell-hidden">2.62</td> <td class="tablesaw-cell-hidden">1.62</td> <td class="tablesaw-cell-hidden">11.38</td> </tr> </tbody> <tfoot></tfoot> </table>

Thank you for taking the time to read my question and hopefully I have explained it fully, I am very new to coding/programming (started a couple of weeks ago) so please keep this in mind with any responses. Thanks again!

You can use pandas:

import pandas as pd

df = pd.read_html(url) # df -> list of tables

print(len(df)) # 29 

And you can choose table what you want.

The table id's are being assigned dynamically so I would suggest an alternative way to get to your table. Let's say you want to get the table for NBA Summer League Stats - Totals, try:

table_heading = 'NBA Summer League Stats - Totals'
table = soup.find(string=re.compile(table_heading))
          .find_parent()
          .find_next_sibling()
print(table)

You can change the table_heading for other headings on the table. Let me know if that helps.

Use pandas to get the table tag and use the id attribute to select the one you want:

import pandas as pd

url = 'https://basketball.realgm.com/player/Luke-Nelson/Summary/50483'
df = pd.read_html(url, attrs={'id':'table-1696'})[0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM