简体   繁体   中英

HTML parsing a crowded webpage with BeautifulSoup

I'm having some trouble parsing basketball-reference. The webpage I'm looking at ( https://www.basketball-reference.com/contracts/IND.html ) seems very bloated, with tons of ad trackers and extraneous menus. I'm trying to extract the data table called "payroll," which has the following html source code (burried within a bunch of other junk -- or at least it looks like junk to me).

<table class="suppress_glossary sortable stats_table" id="contracts" data-cols-to-freeze=1><caption>Payroll Table</caption>
   <colgroup><col><col><col><col><col><col><col><col><col><col></colgroup>
   <thead>

      <tr class="over_header">
         <th aria-label="" data-stat="&nbsp;" colspan="2" class=" over_header center" >&nbsp;</th>
         <th aria-label="" data-stat="header_salary" colspan="6" class=" over_header center" >Salary</th>
         <th aria-label="" data-stat="&nbsp;" colspan="2" class=" over_header center" >&nbsp;</th>
      </tr>



      <tr>
         <th aria-label="Player" data-stat="player" scope="col" class=" poptip sort_default_asc center" >Player</th>
         <th aria-label="Age" data-stat="age_today" scope="col" class=" poptip center" >Age</th>
         <th aria-label="2019-20" data-stat="y1" scope="col" class=" poptip center" data-over-header="Salary" >2019-20</th>
         <th aria-label="2020-21" data-stat="y2" scope="col" class=" poptip center" data-over-header="Salary" >2020-21</th>
         <th aria-label="2021-22" data-stat="y3" scope="col" class=" poptip center" data-over-header="Salary" >2021-22</th>
         <th aria-label="2022-23" data-stat="y4" scope="col" class=" poptip center" data-over-header="Salary" >2022-23</th>
         <th aria-label="2023-24" data-stat="y5" scope="col" class=" poptip center" data-over-header="Salary" >2023-24</th>
         <th aria-label="2024-25" data-stat="y6" scope="col" class=" poptip center" data-over-header="Salary" >2024-25</th>
         <th aria-label="Signed Using" data-stat="signed_using" scope="col" class=" poptip sort_default_asc center" >Signed Using</th>
         <th aria-label="The amount of a player's remaining salary that is guaranteed." data-stat="remain_gtd" scope="col" class=" poptip center" data-tip="The amount of a player's remaining salary that is guaranteed." >Guaranteed</th>
      </tr>

   </thead>
   <tbody>
<tr ><th scope="row" class="left " data-append-csv="oladivi01" data-stat="player" csk="oladivi01" ><a href="/players/o/oladivi01.html">Victor Oladipo</a></th><td class="center " data-stat="age_today" >27</td><td class="right " data-stat="y1" csk="21000000" >$21,000,000</td><td class="right " data-stat="y2" csk="21000000" >$21,000,000</td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="42000000" >$42,000,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="brogdma01" data-stat="player" csk="brogdma01" ><a href="/players/b/brogdma01.html">Malcolm Brogdon</a></th><td class="center " data-stat="age_today" >26</td><td class="right " data-stat="y1" csk="20000000" >$20,000,000</td><td class="right " data-stat="y2" csk="20700000" >$20,700,000</td><td class="right " data-stat="y3" csk="21700000" >$21,700,000</td><td class="right " data-stat="y4" csk="22600000" >$22,600,000</td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="85000000" >$85,000,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="turnemy01" data-stat="player" csk="turnemy01" ><a href="/players/t/turnemy01.html">Myles Turner</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="18000000" >$18,000,000</td><td class="right " data-stat="y2" csk="18000000" >$18,000,000</td><td class="right " data-stat="y3" csk="18000000" >$18,000,000</td><td class="right " data-stat="y4" csk="18000000" >$18,000,000</td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st round pick</td><td class="right " data-stat="remain_gtd" csk="72000000" >$72,000,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="warretj01" data-stat="player" csk="warretj01" ><a href="/players/w/warretj01.html">T.J. Warren</a></th><td class="center " data-stat="age_today" >26</td><td class="right " data-stat="y1" csk="10810000" >$10,810,000</td><td class="right " data-stat="y2" csk="11750000" >$11,750,000</td><td class="right " data-stat="y3" csk="12690000" >$12,690,000</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="35250000" >$35,250,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="lambje01" data-stat="player" csk="lambje01" ><a href="/players/l/lambje01.html">Jeremy Lamb</a></th><td class="center " data-stat="age_today" >27</td><td class="right " data-stat="y1" csk="10500000" >$10,500,000</td><td class="right " data-stat="y2" csk="10500000" >$10,500,000</td><td class="right " data-stat="y3" csk="10500000" >$10,500,000</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="31500000" >$31,500,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="mcderdo01" data-stat="player" csk="mcderdo01" ><a href="/players/m/mcderdo01.html">Doug McDermott</a></th><td class="center " data-stat="age_today" >27</td><td class="right " data-stat="y1" csk="7333334" >$7,333,334</td><td class="right " data-stat="y2" csk="7333333" >$7,333,333</td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="14666667" >$14,666,667</td></tr>
<tr ><th scope="row" class="left " data-append-csv="holidju01" data-stat="player" csk="holidju01" ><a href="/players/h/holidju01.html">Justin Holiday</a></th><td class="center " data-stat="age_today" >30</td><td class="right " data-stat="y1" csk="4767000" >$4,767,000</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Room Exception</td><td class="right " data-stat="remain_gtd" csk="4767000" >$4,767,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="sabondo01" data-stat="player" csk="sabondo01" ><a href="/players/s/sabondo01.html">Domantas Sabonis</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="3529555" >$3,529,555</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round pick</td><td class="right " data-stat="remain_gtd" csk="3529555" >$3,529,555</td></tr>
<tr ><th scope="row" class="left " data-append-csv="mccontj01" data-stat="player" csk="mccontj01" ><a href="/players/m/mccontj01.html">T.J. McConnell</a></th><td class="center " data-stat="age_today" >27</td><td class="right " data-stat="y1" csk="3500000" >$3,500,000</td><td class="right " data-stat="y2" csk="3500000" ><em>$3,500,000</em></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Cap Space</td><td class="right " data-stat="remain_gtd" csk="4500000" >$4,500,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="bitadgo01" data-stat="player" csk="bitadgo01" ><a href="/players/b/bitadgo01.html">Goga Bitadze</a></th><td class="center " data-stat="age_today" >20</td><td class="right " data-stat="y1" csk="2816760" >$2,816,760</td><td class="right " data-stat="y2" csk="2957520" >$2,957,520</td><td class="right salary-tm" data-stat="y3" csk="3098400" >$3,098,400</td><td class="right salary-tm" data-stat="y4" csk="4765339" >$4,765,339</td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="5774280" >$5,774,280</td></tr>
<tr ><th scope="row" class="left " data-append-csv="leaftj01" data-stat="player" csk="leaftj01" ><a href="/players/l/leaftj01.html">T.J. Leaf</a></th><td class="center " data-stat="age_today" >22</td><td class="right " data-stat="y1" csk="2813280" >$2,813,280</td><td class="right salary-tm" data-stat="y2" csk="4326825" >$4,326,825</td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="2813280" >$2,813,280</td></tr>
<tr ><th scope="row" class="left " data-append-csv="holidaa01" data-stat="player" csk="holidaa01" ><a href="/players/h/holidaa01.html">Aaron Holiday</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="2239200" >$2,239,200</td><td class="right salary-tm" data-stat="y2" csk="2345640" >$2,345,640</td><td class="right salary-tm" data-stat="y3" csk="3980551" >$3,980,551</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="2239200" >$2,239,200</td></tr>
<tr ><th scope="row" class="left " data-append-csv="sumneed01" data-stat="player" csk="sumneed01" ><a href="/players/s/sumneed01.html">Edmond Sumner</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="2000000" >$2,000,000</td><td class="right " data-stat="y2" csk="2160000" >$2,160,000</td><td class="right salary-tm" data-stat="y3" csk="2320000" >$2,320,000</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="4160000" >$4,160,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="sampsja02" data-stat="player" csk="sampsja02" ><a href="/players/s/sampsja02.html">JaKarr Sampson</a></th><td class="center " data-stat="age_today" >26</td><td class="right " data-stat="y1" csk="1737145" >$1,737,145</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right " data-stat="remain_gtd" csk="1737145" >$1,737,145</td></tr>
<tr ><th scope="row" class="left " data-append-csv="johnsal02" data-stat="player" csk="johnsal02" ><a href="/players/j/johnsal02.html">Alize Johnson</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="1416852" >$1,416,852</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right " data-stat="remain_gtd" csk="1416852" >$1,416,852</td></tr>
<tr ><th scope="row" class="left " data-append-csv="mitrona01" data-stat="player" csk="mitrona01" ><a href="/players/m/mitrona01.html">Naz Mitrou-Long</a></th><td class="center " data-stat="age_today" >26</td><td class="right " data-stat="y1" >&nbsp;</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Two-Way Contract</td><td class="right " data-stat="remain_gtd" >&nbsp;</td></tr>
<tr ><th scope="row" class="left " data-append-csv="wilcocj01" data-stat="player" csk="wilcocj01" ><a href="/players/w/wilcocj01.html">C.J. Wilcox</a></th><td class="center " data-stat="age_today" >28</td><td class="right iz" data-stat="y1" ></td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right iz" data-stat="remain_gtd" ></td></tr>
<tr ><th scope="row" class="left " data-append-csv="brimaam01" data-stat="player" csk="brimaam01" ><a href="/players/b/brimaam01.html">Amida Brimah</a></th><td class="center " data-stat="age_today" >25</td><td class="right iz" data-stat="y1" ></td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right iz" data-stat="remain_gtd" ></td></tr>
<tr ><th scope="row" class="left " data-append-csv="gantja01" data-stat="player" csk="gantja01" ><a href="/players/g/gantja01.html">Jakeenan Gant</a></th><td class="center " data-stat="age_today" >23</td><td class="right iz" data-stat="y1" ></td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right iz" data-stat="remain_gtd" ></td></tr>
<tr ><th scope="row" class="left " data-append-csv="bowenbr02" data-stat="player" csk="bowenbr02" ><a href="/players/b/bowenbr02.html">Brian Bowen</a></th><td class="center " data-stat="age_today" >21</td><td class="right " data-stat="y1" >&nbsp;</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Two-Way Contract</td><td class="right " data-stat="remain_gtd" >&nbsp;</td></tr>
<tr class='thead'><td colspan='10'></td></tr>
<tr class="partial_table" ><th scope="row" class="left " data-append-csv="ellismo01" data-stat="player" csk="ellismo01" ><a href="/players/e/ellismo01.html"><em>Monta Ellis</em></a></th><td class="center " data-stat="age_today" >33</td><td class="right " data-stat="y1" csk="2245400" >$2,245,400</td><td class="right " data-stat="y2" csk="2245400" >$2,245,400</td><td class="right " data-stat="y3" csk="2245400" >$2,245,400</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="6736200" >$6,736,200</td></tr>

   </tbody>
   <tfoot><tr ><th scope="row" class="left " data-stat="player" >Team Totals</th><td class="center iz" data-stat="age_today" ></td><td class="right " data-stat="y1" >$114,708,526</td><td class="right " data-stat="y2" >$106,818,718</td><td class="right " data-stat="y3" >$74,534,351</td><td class="right " data-stat="y4" >$45,365,339</td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" >$318,090,179</td></tr>

   </tfoot>

</table>

When I run the following python code, the variable l is null.

#import beautiful soup, requests, time, pandas
from bs4 import BeautifulSoup
import requests

#assign the URL for contract scraping
url = 'https://www.basketball-reference.com/teams/IND.html'

#pull html from page
page = requests.get(url)

#format html using BS
soup = BeautifulSoup(page.text, "html.parser")

#take only table rows
l = soup.find_all('a',{'class':'left'})

print(l)

I am wondering if I don't have the correct argument for class. Or is there another reason that print(l) is returning []?

The left class you are after is not associated with anchor tag that is why you are getting zero record.Try below code.

from bs4 import BeautifulSoup
import requests
r=requests.get("https://www.basketball-reference.com/contracts/IND.html")
soup=BeautifulSoup(r.text,'html.parser')
l=soup.select('.left > a')
print(l)

If you want to fetch the name of the players.

from bs4 import BeautifulSoup
import requests
r=requests.get("https://www.basketball-reference.com/contracts/IND.html")
soup=BeautifulSoup(r.text,'html.parser')
l=[item.text for item in soup.select('.left > a')]
print(l)

Output :

['Victor Oladipo', 'Malcolm Brogdon', 'Myles Turner', 'T.J. Warren', 'Jeremy Lamb', 'Doug McDermott', 'Justin Holiday', 'Domantas Sabonis', 'T.J. McConnell', 'Goga Bitadze', 'T.J. Leaf', 'Aaron Holiday', 'Edmond Sumner', 'JaKarr Sampson', 'Alize Johnson', 'Brian Bowen', 'Naz Mitrou-Long', 'C.J. Wilcox', 'Amida Brimah', 'Jakeenan Gant', 'Monta Ellis']

You say you want the payroll table. You can use pandas read_html for this

import pandas as pd

table = pd.read_html('https://www.basketball-reference.com/contracts/IND.html')[0]
print(table)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM