简体   繁体   English

Python脚本从HTML页面提取数据

[英]Python script extract data from HTML page

I'm trying to do a massive data accumulation on college basketball teams. 我正在尝试对大学篮球队进行大量的数据积累。 This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats. 此链接: https : //www.teamrankings.com/ncb/stats/具有大量的团队统计信息。

I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links. 我尝试编写一个脚本,从该页面扫描所有所需链接(所有团队统计信息),找到指定团队的排名(输入),然后从所有链接返回该团队排名的总和。

I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8 我很亲切地发现了这个: https : //gist.github.com/phillipsm/404780e419c49a5b62a8

...which is GREAT! ...太棒了!

But I must have something wrong because I'm getting 0 但是我一定有毛病,因为我得0

Here's my code: 这是我的代码:

import requests
from bs4 import BeautifulSoup
import time

url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")

stat_links = []

for table_row in soup.select(".expand-section li"):

    table_cells = table_row.findAll('li')

    if len(table_cells) > 0:
        link = table_cells[0].find('a')['href']
        stat_links.append(link)

total_rank = 0

for link in stat_links:
    r = requests.get(link)
    soup = BeaultifulSoup(r.text)

    team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")

    for row in team_rows:
        if row.findAll('td')[1].text.strip() == 'Oklahoma':
            rank = row.findAll('td')[0].text.strip()
            total_rank = total_rank + rank

print total_rank

Check out that link to double check I have the correct class specified. 检查该链接以再次检查我是否指定了正确的类。 I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno. 我感觉问题可能出在第一个for循环中,其中我选择了一个li标签,然后选择了第一个标签中的所有li标签,我不知道。

I don't use Python so I'm unfamiliar with any debugging tools. 我不使用Python,所以我对任何调试工具都不熟悉。 So if anyone wants to forward me to one of those that would be great! 因此,如果有人想将我转发到其中之一,那就太好了!

First, the team stats and player stats sections are contained in a 'div class='large column-2'. 首先,团队统计数据和球员统计数据部分包含在“ div class =” large column-2”中。 The team stats are in the first occurrence. 球队统计数据首次出现。 Then you can find all of the href tags within it. 然后,您可以在其中找到所有href标签。 I've combined both in a one-liner. 我将两者结合在一起。

teamstats = soup(class_='column large-2')[0].find_all(href=True)

The teamstats list contains all of the 'a' tags. teamstats列表包含所有'a'标记。 Use a list comprehension to extract the links. 使用列表推导来提取链接。 A few of the hrefs contained "#" (part of navigation links) so I excluded them. 一些href包含“#”(导航链接的一部分),因此我将它们排除在外。

links = [a['href'] for a in teamstats if a['href'] != '#']

Here is a sample of output: 这是输出示例:

links
Out[84]: 
['/ncaa-basketball/stat/points-per-game',
 '/ncaa-basketball/stat/average-scoring-margin',
 '/ncaa-basketball/stat/offensive-efficiency',
 '/ncaa-basketball/stat/floor-percentage',
 '/ncaa-basketball/stat/1st-half-points-per-game',

A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. 在我的机器上运行您的代码,然后执行-> table_cells = table_row.findAll('li')行,总是返回一个空列表,因此stat_links最终为一个空数组,因此对stat_links的迭代永远不会进行,并且total_rank不会增加。 I suggest you fiddle around with the way you find all the list elements. 我建议您在查找所有列表元素的方式上乱搞。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM