简体   繁体   English

美丽的汤解析统计

[英]Beautiful Soup Parsing Stats

Trying to program the extraction of stats from the following html format...translating into something like awayPlayers = [['Carmelo Anthony', '30', '19', '5', '3'], ['Kristaps Porzingis'....]] so that I can easily display it in my own format and work with the data. 尝试对以下html格式的统计信息提取进行编程...将其转换为awayPlayers = [['Carmelo Anthony','30','19','5','3'],['Kristaps Porzingis' ....]],这样我就可以轻松以自己的格式显示它并使用数据。

I've got the basics of BeautifulSoup down but as far as this project goes I'm a bit lost, as the stats I want are all simply surrounded by td tags..ANY HELP IS MUCH APPRECIATED!!! 我已经了解了BeautifulSoup的基础知识,但是就这个项目而言,我有点迷失了,因为我想要的统计信息都完全被td标签包围了。任何帮助都非常多!!!

 <div class="standings"> 
     <h3 class="standings-title">NYK</h3> 
     <div class="awayTeam-boxscore"> 
      <table> 
       <tbody>
        <tr class="table-header"> 
         <td>Name</td> 
         <td>MIN</td> 
         <td>PTS</td> 
         <td>REB</td> 
         <td>AST</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/carmelo_anthony/index.html?locale=en_US">C.Anthony</a></td> 
         <td>30</td> 
         <td>19</td> 
         <td>5</td> 
         <td>3</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/kristaps_porzingis/index.html?locale=en_US">K.Porzingis</a></td> 
         <td>33</td> 
         <td>16</td> 
         <td>7</td> 
         <td>0</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/joakim_noah/index.html?locale=en_US">J.Noah</a></td> 
         <td>20</td> 
         <td>0</td> 
         <td>6</td> 
         <td>3</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/courtney_lee/index.html?locale=en_US">C.Lee</a></td> 
         <td>20</td> 
         <td>0</td> 
         <td>3</td> 
         <td>0</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/derrick_rose/index.html?locale=en_US">D.Rose</a></td> 
         <td>30</td> 
         <td>17</td> 
         <td>3</td> 
         <td>1</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/brandon_jennings/index.html?locale=en_US">B.Jennings</a></td> 
         <td>21</td> 
         <td>7</td> 
         <td>3</td> 
         <td>5</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/kyle_oquinn/index.html?locale=en_US">K.O'Quinn</a></td> 
         <td>15</td> 
         <td>2</td> 
         <td>5</td> 
         <td>1</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/lance_thomas/index.html?locale=en_US">L.Thomas</a></td> 
         <td>17</td> 
         <td>2</td> 
         <td>1</td> 
         <td>1</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/justin_holiday/index.html?locale=en_US">J.Holiday</a></td> 
         <td>26</td> 
         <td>8</td> 
         <td>6</td> 
         <td>2</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/willy_hernangomez/index.html?locale=en_US">W.Hernangomez</a></td> 
         <td>9</td> 
         <td>4</td> 
         <td>1</td> 
         <td>0</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/sasha_vujacic/index.html?locale=en_US">S.Vujacic</a></td> 
         <td>3</td> 
         <td>1</td> 
         <td>0</td> 
         <td>1</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/mindaugas_kuzminskas/index.html?locale=en_US">M.Kuzminskas</a></td> 
         <td>9</td> 
         <td>7</td> 
         <td>1</td> 
         <td>0</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/ron_baker/index.html?locale=en_US">R.Baker</a></td> 
         <td>7</td> 
         <td>5</td> 
         <td>1</td> 
         <td>0</td> 
        </tr> 
       </tbody>
      </table> 
     </div> 
     <h3 class="standings-title">CLE</h3> 
     <div class="homeTeam-boxscore"> 
      <table> 
       <tbody>
        <tr class="table-header"> 
         <td>Name</td> 
         <td>MIN</td> 
         <td>PTS</td> 
         <td>REB</td> 
         <td>AST</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/lebron_james/index.html?locale=en_US">L.James</a></td> 
         <td>32</td> 
         <td>19</td> 
         <td>11</td> 
         <td>14</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/kevin_love/index.html?locale=en_US">K.Love</a></td> 
         <td>25</td> 
         <td>23</td> 
         <td>12</td> 
         <td>2</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/tristan_t_thompson/index.html?locale=en_US">T.Thompson</a></td> 
         <td>22</td> 
         <td>0</td> 
         <td>6</td> 
         <td>0</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/jr_smith/index.html?locale=en_US">J.Smith</a></td> 
         <td>25</td> 
         <td>8</td> 
         <td>3</td> 
         <td>2</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/kyrie_irving/index.html?locale=en_US">K.Irving</a></td> 
         <td>30</td> 
         <td>29</td> 
         <td>2</td> 
         <td>4</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/richard_jefferson/index.html?locale=en_US">R.Jefferson</a></td> 
         <td>26</td> 
         <td>13</td> 
         <td>4</td> 
         <td>1</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/iman_shumpert/index.html?locale=en_US">I.Shumpert</a></td> 
         <td>14</td> 
         <td>2</td> 
         <td>2</td> 
         <td>3</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/mike_dunleavy/index.html?locale=en_US">M.Dunleavy</a></td> 
         <td>23</td> 
         <td>4</td> 
         <td>4</td> 
         <td>2</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/channing_frye/index.html?locale=en_US">C.Frye</a></td> 
         <td>14</td> 
         <td>6</td> 
         <td>4</td> 
         <td>0</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/jordan_mcrae/index.html?locale=en_US">J.McRae</a></td> 
         <td>6</td> 
         <td>2</td> 
         <td>0</td> 
         <td>0</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/deandre_liggins/index.html?locale=en_US">D.Liggins</a></td> 
         <td>12</td> 
         <td>4</td> 
         <td>3</td> 
         <td>3</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/chris_andersen/index.html?locale=en_US">C.Andersen</a></td> 
         <td>6</td> 
         <td>2</td> 
         <td>0</td> 
         <td>0</td> 
        </tr> 
        <tr> 
         <td><a href="/feature/player/james_jones/index.html?locale=en_US">J.Jones</a></td> 
         <td>6</td> 
         <td>5</td> 
         <td>0</td> 
         <td>0</td> 
        </tr> 
       </tbody>
      </table> 
     </div> 
    </div> 
   </section> 
   <footer> 
    <nav> 
     <div class="footer-nav"> 
      <div class="access-key-navigation"> 
       <div>
        <span>0.</span>
        <a accesskey="0" href="/feature/index.html?locale=en_US">Home</a>
       </div> 
       <div>
        <span>1.</span>
        <a accesskey="1" href="/feature/about/index.html?locale=en_US">About</a>
       </div> 
       <div class="selected">
        <span>2.</span>
        <a accesskey="2" href="/feature/scores/index.html?locale=en_US">Scores</a>
       </div> 
       <div>
        <span>3.</span>
        <a accesskey="3" href="/feature/news/index.html?locale=en_US">News</a>
       </div> 
       <div>
        <span>4.</span>
        <a accesskey="4" href="/feature/players/index.html?locale=en_US">Players</a>
       </div> 
       <div>
        <span>5.</span>
        <a accesskey="5" href="/feature/season/leaders.html?locale=en_US">Leaders</a>
       </div> 
       <div>
        <span>6.</span>
        <a accesskey="6" href="/feature/standings/index.html?locale=en_US">Standings</a>
       </div> 
       <div>
        <span>7.</span>
        <a accesskey="7" href="/feature/teams/index.html?locale=en_US">Teams</a>
       </div> 
      </div> 
      <div class="copyright">
       © 2016 NBA Media Ventures, LLC. All rights reserved
      </div> 
     </div> 
    </nav> 
   </footer> 
  </div> 

Each row represents one player, so you have to iterate through the <tr> tags and extract the data inside. 每行代表一个玩家,因此您必须遍历<tr>标记并提取其中的数据。 Here is how: 方法如下:

from bs4 import BeautifulSoup

# replace with the html
html_doc = """<div> ... </div>"""    

soup = BeautifulSoup(html_doc, "html.parser")

# this is where we store the extracted data
players = []

# iterates through the table rows
for row in soup.find_all('tr'):
    # this takes the text (which is seperated by \n in you case) 
    # and the "if data" is used to clean up empty entries
    player_data = [data for data in row.get_text().split("\n") if data]
    players.append(player_data)

# we remove the first entry, as it's the table headers
del players[0]

print(players) 

Output: 输出:

[['C.Anthony', '30', '19', '5', '3'], ['K.Porzingis', '33', '16', '7', '0'], ['J.Noah', '20', '0', '6', '3'], ['C.Lee', '20', '0', '3', '0'], ['D.Rose', '30', '17', '3', '1'], ['B.Jennings', '21', '7', '3', '5'], ["K.O'Quinn", '15', '2', '5', '1'], ['L.Thomas', '17', '2', '1', '1'], ['J.Holiday', '26', '8', '6', '2'], ['W.Hernangomez', '9', '4', '1', '0'], ['S.Vujacic', '3', '1', '0', '1'], ['M.Kuzminskas', '9', '7', '1', '0'], ['R.Baker', '7', '5', '1', '0'], ['Name', 'MIN', 'PTS', 'REB', 'AST'], ['L.James', '32', '19', '11', '14'], ['K.Love', '25', '23', '12', '2'], ['T.Thompson', '22', '0', '6', '0'], ['J.Smith', '25', '8', '3', '2'], ['K.Irving', '30', '29', '2', '4'], ['R.Jefferson', '26', '13', '4', '1'], ['I.Shumpert', '14', '2', '2', '3'], ['M.Dunleavy', '23', '4', '4', '2'], ['C.Frye', '14', '6', '4', '0'], ['J.McRae', '6', '2', '0', '0'], ['D.Liggins', '12', '4', '3', '3'], ['C.Andersen', '6', '2', '0', '0'], ['J.Jones', '6', '5', '0', '0']]

If you'd like to get the full name, you'll have to extract this from the href in the <a> surrounding each player. 如果要获取全名,则必须从每个玩家周围<a>href中提取该名称。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM