简体   繁体   English

HTML 使用 BeautifulSoup 解析拥挤的网页

[英]HTML parsing a crowded webpage with BeautifulSoup

I'm having some trouble parsing basketball-reference.我在解析篮球参考时遇到了一些问题。 The webpage I'm looking at ( https://www.basketball-reference.com/contracts/IND.html ) seems very bloated, with tons of ad trackers and extraneous menus.我正在查看的网页( https://www.basketball-reference.com/contracts/IND.html )似乎非常臃肿,有大量的广告跟踪器和无关的菜单。 I'm trying to extract the data table called "payroll," which has the following html source code (burried within a bunch of other junk -- or at least it looks like junk to me).我正在尝试提取名为“payroll”的数据表,它具有以下 html 源代码(埋在一堆其他垃圾中——或者至少在我看来它看起来像垃圾)。

<table class="suppress_glossary sortable stats_table" id="contracts" data-cols-to-freeze=1><caption>Payroll Table</caption>
   <colgroup><col><col><col><col><col><col><col><col><col><col></colgroup>
   <thead>

      <tr class="over_header">
         <th aria-label="" data-stat="&nbsp;" colspan="2" class=" over_header center" >&nbsp;</th>
         <th aria-label="" data-stat="header_salary" colspan="6" class=" over_header center" >Salary</th>
         <th aria-label="" data-stat="&nbsp;" colspan="2" class=" over_header center" >&nbsp;</th>
      </tr>



      <tr>
         <th aria-label="Player" data-stat="player" scope="col" class=" poptip sort_default_asc center" >Player</th>
         <th aria-label="Age" data-stat="age_today" scope="col" class=" poptip center" >Age</th>
         <th aria-label="2019-20" data-stat="y1" scope="col" class=" poptip center" data-over-header="Salary" >2019-20</th>
         <th aria-label="2020-21" data-stat="y2" scope="col" class=" poptip center" data-over-header="Salary" >2020-21</th>
         <th aria-label="2021-22" data-stat="y3" scope="col" class=" poptip center" data-over-header="Salary" >2021-22</th>
         <th aria-label="2022-23" data-stat="y4" scope="col" class=" poptip center" data-over-header="Salary" >2022-23</th>
         <th aria-label="2023-24" data-stat="y5" scope="col" class=" poptip center" data-over-header="Salary" >2023-24</th>
         <th aria-label="2024-25" data-stat="y6" scope="col" class=" poptip center" data-over-header="Salary" >2024-25</th>
         <th aria-label="Signed Using" data-stat="signed_using" scope="col" class=" poptip sort_default_asc center" >Signed Using</th>
         <th aria-label="The amount of a player's remaining salary that is guaranteed." data-stat="remain_gtd" scope="col" class=" poptip center" data-tip="The amount of a player's remaining salary that is guaranteed." >Guaranteed</th>
      </tr>

   </thead>
   <tbody>
<tr ><th scope="row" class="left " data-append-csv="oladivi01" data-stat="player" csk="oladivi01" ><a href="/players/o/oladivi01.html">Victor Oladipo</a></th><td class="center " data-stat="age_today" >27</td><td class="right " data-stat="y1" csk="21000000" >$21,000,000</td><td class="right " data-stat="y2" csk="21000000" >$21,000,000</td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="42000000" >$42,000,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="brogdma01" data-stat="player" csk="brogdma01" ><a href="/players/b/brogdma01.html">Malcolm Brogdon</a></th><td class="center " data-stat="age_today" >26</td><td class="right " data-stat="y1" csk="20000000" >$20,000,000</td><td class="right " data-stat="y2" csk="20700000" >$20,700,000</td><td class="right " data-stat="y3" csk="21700000" >$21,700,000</td><td class="right " data-stat="y4" csk="22600000" >$22,600,000</td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="85000000" >$85,000,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="turnemy01" data-stat="player" csk="turnemy01" ><a href="/players/t/turnemy01.html">Myles Turner</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="18000000" >$18,000,000</td><td class="right " data-stat="y2" csk="18000000" >$18,000,000</td><td class="right " data-stat="y3" csk="18000000" >$18,000,000</td><td class="right " data-stat="y4" csk="18000000" >$18,000,000</td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st round pick</td><td class="right " data-stat="remain_gtd" csk="72000000" >$72,000,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="warretj01" data-stat="player" csk="warretj01" ><a href="/players/w/warretj01.html">T.J. Warren</a></th><td class="center " data-stat="age_today" >26</td><td class="right " data-stat="y1" csk="10810000" >$10,810,000</td><td class="right " data-stat="y2" csk="11750000" >$11,750,000</td><td class="right " data-stat="y3" csk="12690000" >$12,690,000</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="35250000" >$35,250,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="lambje01" data-stat="player" csk="lambje01" ><a href="/players/l/lambje01.html">Jeremy Lamb</a></th><td class="center " data-stat="age_today" >27</td><td class="right " data-stat="y1" csk="10500000" >$10,500,000</td><td class="right " data-stat="y2" csk="10500000" >$10,500,000</td><td class="right " data-stat="y3" csk="10500000" >$10,500,000</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="31500000" >$31,500,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="mcderdo01" data-stat="player" csk="mcderdo01" ><a href="/players/m/mcderdo01.html">Doug McDermott</a></th><td class="center " data-stat="age_today" >27</td><td class="right " data-stat="y1" csk="7333334" >$7,333,334</td><td class="right " data-stat="y2" csk="7333333" >$7,333,333</td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="14666667" >$14,666,667</td></tr>
<tr ><th scope="row" class="left " data-append-csv="holidju01" data-stat="player" csk="holidju01" ><a href="/players/h/holidju01.html">Justin Holiday</a></th><td class="center " data-stat="age_today" >30</td><td class="right " data-stat="y1" csk="4767000" >$4,767,000</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Room Exception</td><td class="right " data-stat="remain_gtd" csk="4767000" >$4,767,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="sabondo01" data-stat="player" csk="sabondo01" ><a href="/players/s/sabondo01.html">Domantas Sabonis</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="3529555" >$3,529,555</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round pick</td><td class="right " data-stat="remain_gtd" csk="3529555" >$3,529,555</td></tr>
<tr ><th scope="row" class="left " data-append-csv="mccontj01" data-stat="player" csk="mccontj01" ><a href="/players/m/mccontj01.html">T.J. McConnell</a></th><td class="center " data-stat="age_today" >27</td><td class="right " data-stat="y1" csk="3500000" >$3,500,000</td><td class="right " data-stat="y2" csk="3500000" ><em>$3,500,000</em></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Cap Space</td><td class="right " data-stat="remain_gtd" csk="4500000" >$4,500,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="bitadgo01" data-stat="player" csk="bitadgo01" ><a href="/players/b/bitadgo01.html">Goga Bitadze</a></th><td class="center " data-stat="age_today" >20</td><td class="right " data-stat="y1" csk="2816760" >$2,816,760</td><td class="right " data-stat="y2" csk="2957520" >$2,957,520</td><td class="right salary-tm" data-stat="y3" csk="3098400" >$3,098,400</td><td class="right salary-tm" data-stat="y4" csk="4765339" >$4,765,339</td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="5774280" >$5,774,280</td></tr>
<tr ><th scope="row" class="left " data-append-csv="leaftj01" data-stat="player" csk="leaftj01" ><a href="/players/l/leaftj01.html">T.J. Leaf</a></th><td class="center " data-stat="age_today" >22</td><td class="right " data-stat="y1" csk="2813280" >$2,813,280</td><td class="right salary-tm" data-stat="y2" csk="4326825" >$4,326,825</td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="2813280" >$2,813,280</td></tr>
<tr ><th scope="row" class="left " data-append-csv="holidaa01" data-stat="player" csk="holidaa01" ><a href="/players/h/holidaa01.html">Aaron Holiday</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="2239200" >$2,239,200</td><td class="right salary-tm" data-stat="y2" csk="2345640" >$2,345,640</td><td class="right salary-tm" data-stat="y3" csk="3980551" >$3,980,551</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="2239200" >$2,239,200</td></tr>
<tr ><th scope="row" class="left " data-append-csv="sumneed01" data-stat="player" csk="sumneed01" ><a href="/players/s/sumneed01.html">Edmond Sumner</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="2000000" >$2,000,000</td><td class="right " data-stat="y2" csk="2160000" >$2,160,000</td><td class="right salary-tm" data-stat="y3" csk="2320000" >$2,320,000</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="4160000" >$4,160,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="sampsja02" data-stat="player" csk="sampsja02" ><a href="/players/s/sampsja02.html">JaKarr Sampson</a></th><td class="center " data-stat="age_today" >26</td><td class="right " data-stat="y1" csk="1737145" >$1,737,145</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right " data-stat="remain_gtd" csk="1737145" >$1,737,145</td></tr>
<tr ><th scope="row" class="left " data-append-csv="johnsal02" data-stat="player" csk="johnsal02" ><a href="/players/j/johnsal02.html">Alize Johnson</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="1416852" >$1,416,852</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right " data-stat="remain_gtd" csk="1416852" >$1,416,852</td></tr>
<tr ><th scope="row" class="left " data-append-csv="mitrona01" data-stat="player" csk="mitrona01" ><a href="/players/m/mitrona01.html">Naz Mitrou-Long</a></th><td class="center " data-stat="age_today" >26</td><td class="right " data-stat="y1" >&nbsp;</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Two-Way Contract</td><td class="right " data-stat="remain_gtd" >&nbsp;</td></tr>
<tr ><th scope="row" class="left " data-append-csv="wilcocj01" data-stat="player" csk="wilcocj01" ><a href="/players/w/wilcocj01.html">C.J. Wilcox</a></th><td class="center " data-stat="age_today" >28</td><td class="right iz" data-stat="y1" ></td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right iz" data-stat="remain_gtd" ></td></tr>
<tr ><th scope="row" class="left " data-append-csv="brimaam01" data-stat="player" csk="brimaam01" ><a href="/players/b/brimaam01.html">Amida Brimah</a></th><td class="center " data-stat="age_today" >25</td><td class="right iz" data-stat="y1" ></td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right iz" data-stat="remain_gtd" ></td></tr>
<tr ><th scope="row" class="left " data-append-csv="gantja01" data-stat="player" csk="gantja01" ><a href="/players/g/gantja01.html">Jakeenan Gant</a></th><td class="center " data-stat="age_today" >23</td><td class="right iz" data-stat="y1" ></td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right iz" data-stat="remain_gtd" ></td></tr>
<tr ><th scope="row" class="left " data-append-csv="bowenbr02" data-stat="player" csk="bowenbr02" ><a href="/players/b/bowenbr02.html">Brian Bowen</a></th><td class="center " data-stat="age_today" >21</td><td class="right " data-stat="y1" >&nbsp;</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Two-Way Contract</td><td class="right " data-stat="remain_gtd" >&nbsp;</td></tr>
<tr class='thead'><td colspan='10'></td></tr>
<tr class="partial_table" ><th scope="row" class="left " data-append-csv="ellismo01" data-stat="player" csk="ellismo01" ><a href="/players/e/ellismo01.html"><em>Monta Ellis</em></a></th><td class="center " data-stat="age_today" >33</td><td class="right " data-stat="y1" csk="2245400" >$2,245,400</td><td class="right " data-stat="y2" csk="2245400" >$2,245,400</td><td class="right " data-stat="y3" csk="2245400" >$2,245,400</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="6736200" >$6,736,200</td></tr>

   </tbody>
   <tfoot><tr ><th scope="row" class="left " data-stat="player" >Team Totals</th><td class="center iz" data-stat="age_today" ></td><td class="right " data-stat="y1" >$114,708,526</td><td class="right " data-stat="y2" >$106,818,718</td><td class="right " data-stat="y3" >$74,534,351</td><td class="right " data-stat="y4" >$45,365,339</td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" >$318,090,179</td></tr>

   </tfoot>

</table>

When I run the following python code, the variable l is null.当我运行以下 python 代码时,变量 l 为 null。

#import beautiful soup, requests, time, pandas
from bs4 import BeautifulSoup
import requests

#assign the URL for contract scraping
url = 'https://www.basketball-reference.com/teams/IND.html'

#pull html from page
page = requests.get(url)

#format html using BS
soup = BeautifulSoup(page.text, "html.parser")

#take only table rows
l = soup.find_all('a',{'class':'left'})

print(l)

I am wondering if I don't have the correct argument for class.我想知道我是否没有正确的 class 参数。 Or is there another reason that print(l) is returning []?还是 print(l) 返回 [] 的另一个原因?

The left class you are after is not associated with anchor tag that is why you are getting zero record.Try below code.您所追求的左侧 class 与锚标签无关,这就是您获得零记录的原因。试试下面的代码。

from bs4 import BeautifulSoup
import requests
r=requests.get("https://www.basketball-reference.com/contracts/IND.html")
soup=BeautifulSoup(r.text,'html.parser')
l=soup.select('.left > a')
print(l)

If you want to fetch the name of the players.如果你想获取玩家的名字。

from bs4 import BeautifulSoup
import requests
r=requests.get("https://www.basketball-reference.com/contracts/IND.html")
soup=BeautifulSoup(r.text,'html.parser')
l=[item.text for item in soup.select('.left > a')]
print(l)

Output : Output

['Victor Oladipo', 'Malcolm Brogdon', 'Myles Turner', 'T.J. Warren', 'Jeremy Lamb', 'Doug McDermott', 'Justin Holiday', 'Domantas Sabonis', 'T.J. McConnell', 'Goga Bitadze', 'T.J. Leaf', 'Aaron Holiday', 'Edmond Sumner', 'JaKarr Sampson', 'Alize Johnson', 'Brian Bowen', 'Naz Mitrou-Long', 'C.J. Wilcox', 'Amida Brimah', 'Jakeenan Gant', 'Monta Ellis']

You say you want the payroll table.你说你想要工资表。 You can use pandas read_html for this您可以为此使用 pandas read_html

import pandas as pd

table = pd.read_html('https://www.basketball-reference.com/contracts/IND.html')[0]
print(table)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM