简体   繁体   English

从下拉菜单 Python 中抓取每个表

[英]Scrape Each Table from Drop Down Menu Python

I am looking to scrape Division 3 College Basketball stats from the following NCAA stats page:我希望从以下 NCAA 统计页面中获取 Division 3 大学篮球统计数据:

https://stats.ncaa.org/rankings/change_sport_year_div https://stats.ncaa.org/rankings/change_sport_year_div

To get to the page I am on, after clicking the link, Select Sport = Men's Basketball, Year = 2019-2020, and Div = III要访问我所在的页面,请单击链接后,Select Sport = Men's Basketball, Year = 2019-2020, Div = III

Upon clicking the link, there is a dropdown above the top left corner table.单击链接后,左上角表格上方有一个下拉菜单。 It is labeled "Additional Stats".它被标记为“附加统计信息”。 For each stat there is a table which you can get an excel file of, but I want to be more efficient.对于每个统计数据,都有一个表格,您可以获得 excel 文件,但我想提高效率。 I was thinking there could be a way to iterate through the dropdown bar using BeautifulSoup (or perhaps even pd.read_html) to get a dataframe for every stat listed.我在想有一种方法可以使用 BeautifulSoup (甚至可能是 pd.read_html )遍历下拉栏,以获取每个列出的统计数据的 dataframe 。 Is there a way to do this?有没有办法做到这一点? Going through each stat manually, downloading the excel file, and reading the excel file into pandas would be a pain.手动检查每个统计数据,下载 excel 文件,然后将 excel 文件读入 pandas 会很痛苦。 Thank you.谢谢你。

在此处输入图像描述

Here is my suggestion, to use a combination of requests , beautifulsoup and a great html table parser from Scott Rome (I modified a bit the parse_html_table function to remove \n and strip whitespaces).这是我的建议,结合使用requestsbeautifulsoup和来自Scott Rome的出色 html 表解析器(我修改了一点parse_html_table ZC1C425268E68385D1AB5074C17A94F 以删除\n )。

First, you can see when you inspect the source code of the page that it takes the form: "https://stats.ncaa.org/rankings/national_ranking?academic_year=2020.0&division=3.0&ranking_period=110.0&sport_code=MBB&stat_seq=145.0" for instance for the stat 145 ie "Scoring Offense".首先,当您检查页面的源代码时,您可以看到它采用以下形式: "https://stats.ncaa.org/rankings/national_ranking?academic_year=2020.0&division=3.0&ranking_period=110.0&sport_code=MBB&stat_seq=145.0"例如对于 stat 145 即“得分进攻”。

You can therefore use the following code on each of these urls by replacing the 145.0 with values corresponding to the different stats, which you can see when you inspect the source code of the page.因此,您可以通过将145.0替换为与不同统计信息对应的值,在每个 URL 上使用以下代码,您可以在检查页面的源代码时看到这些值。

# <option value="625">3-pt Field Goal Attempts</option>
# <option value="474">Assist Turnover Ratio</option>
# <option value="216">Assists Per Game</option>
# ...

For a specific stat, here for instance scoring offense, you can use the following code to extract the table as a pandas DataFrame:对于特定的统计数据,例如得分进攻,您可以使用以下代码将表格提取为 pandas DataFrame:

import pandas as pd
from bs4 import BeautifulSoup
import requests


el = "https://stats.ncaa.org/rankings/national_ranking?academic_year=2020.0&division=3.0&ranking_period=110.0&sport_code=MBB&stat_seq=145.0"
page = requests.get(el).content.decode('utf-8')
soup = BeautifulSoup(page, "html.parser")
ta = soup.find_all('table', {"id": "rankings_table"})

# Scott Rome function tweaked a bit
def parse_html_table(table):
    n_columns = 0
    n_rows = 0
    column_names = []

    # Find number of rows and columns
    # we also find the column titles if we can
    for row in table.find_all('tr'):

        # Determine the number of rows in the table
        td_tags = row.find_all('td')
        if len(td_tags) > 0:
            n_rows += 1
            if n_columns == 0:
                # Set the number of columns for our table
                n_columns = len(td_tags)

        # Handle column names if we find them
        th_tags = row.find_all('th')
        if len(th_tags) > 0 and len(column_names) == 0:
            for th in th_tags:
                column_names.append(th.get_text())

    # Safeguard on Column Titles
    if len(column_names) > 0 and len(column_names) != n_columns:
        raise Exception("Column titles do not match the number of columns")

    columns = column_names if len(column_names) > 0 else range(0, n_columns)
    df = pd.DataFrame(columns=columns,
                      index=range(0, n_rows))
    row_marker = 0
    for row in table.find_all('tr'):
        column_marker = 0
        columns = row.find_all('td')
        for column in columns:
            df.iat[row_marker, column_marker] = column.get_text()
            column_marker += 1
        if len(columns) > 0:
            row_marker += 1

    # remove \n
    for col in df:
        try:
            df[col] = df[col].str.replace("\n", "")
            df[col] = df[col].str.strip()
        except ValueError:
            pass
    # Convert to float if possible
    for col in df:
        try:
            df[col] = df[col].astype(float)
        except ValueError:
            pass

    return df


example = parse_html_table(ta[0])

The result is结果是

 Rank                           Team    GM    W-L    PTS    PPG
0    1             Greenville (SLIAC)  27.0  14-13  3,580  132.6
1    2  Grinnell (Midwest Conference)  25.0  13-12  2,717  108.7
2    3             Pacific (OR) (NWC)  25.0   7-18  2,384   95.4
3    4                  Whitman (NWC)  28.0   20-8  2,646   94.5
4    5            Valley Forge (ACAA)  22.0  12-11  2,047   93.0
...

Now, what you have to do is apply this to all stat values mentioned above.现在,您要做的就是将其应用于上述所有统计值。

You can make a function of the code above, and apply it in a for loop to the url "https://stats.ncaa.org/rankings/national_ranking?academic_year=2020.0&division=3.0&ranking_period=110.0&sport_code=MBB&stat_seq={}".format(stat) where stat is in list of all possible values. You can make a function of the code above, and apply it in a for loop to the url "https://stats.ncaa.org/rankings/national_ranking?academic_year=2020.0&division=3.0&ranking_period=110.0&sport_code=MBB&stat_seq={}".format(stat)其中stat在所有可能值的列表中。

Hope it helps.希望能帮助到你。

Maybe a more concise way to do this:也许更简洁的方法可以做到这一点:

import requests as rq
from bs4 import BeautifulSoup as bs
import pandas as pd

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0)     Gecko/20100101 Firefox/76.0"}
params = {"sport_code": "MBB", "stat_seq": "518", "academic_year": "2020.0",  "division":"3.0", "ranking_period":"110.0"}
url = "https://stats.ncaa.org/rankings/national_ranking"

resp = rq.post(url, headers=headers, params=params)
soup = bs(resp.content)

colnames = [th.text.strip() for th in soup.find_all("thead")[0].find_all("th")]
data = [[td.text.strip() for td in tr.find_all('td')] for tr in soup.find_all('tbody')[0].find_all("tr")]

df = pd.DataFrame(data, columns=colnames)
df.astype({"GM": 'int32'}).dtypes # convert column in type u want

You have to look at the XHR requests [on Mozilla: F12 -> Network -> XHR].您必须查看 XHR 请求 [on Mozilla: F12 -> Network -> XHR]。

When you select an item from the dropdown list, this makes a post Request through the following url: https://stats.ncaa.org/rankings/national_ranking .当您从下拉列表中选择 select 项目时,这会通过以下 url 发出请求: https://stats.ncaa.org/ranking

Some params are required to make this post request, one of them is "stat_seq".发出此发布请求需要一些参数,其中之一是“stat_seq”。 The value corresponds to the "value" of dropdown options.该值对应于下拉选项的“值”。

Inspector give you the list of "value"-StatName correspondence: Inspector 为您提供“值”-StatName 对应关系的列表:

<option value="625" selected="selected">3-pt Field Goal Attempts</option>
<option value="474">Assist Turnover Ratio</option>
<option value="216">Assists Per Game</option>
<option value="214">Blocked Shots Per Game</option>
<option value="859">Defensive Rebounds per Game</option>
<option value="642">Fewest Fouls</option>
...
...
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM