简体   繁体   English

网页抓取数据表以取得优异成绩

[英]Web scraping data tables to excel

I'm trying to scrape data from a site to excel.我正在尝试从网站上抓取数据以取得优异成绩。 right now it's working fine but when it searches someone like Sergio Rodriguez, multiple names come up ( https://basketball.realgm.com/search?q=Sergio+Rodriguez ), so it skips the name and throws out "No international table for Sergio Rodriguez."现在它工作正常,但是当它搜索像 Sergio Rodriguez 这样的人时,会出现多个名字( https://basketball.realgm.com/search?q=Sergio+Rodriguez ),所以它会跳过这个名字并抛出“No international table塞尔吉奥·罗德里格斯。” How do I select the one that played in the NBA from that list and continue on with scraping the per game and advanced stats tables to excel?我如何从该列表中选择在 NBA 打过球的人,然后继续抓取每场比赛和高级统计数据表以取得优异成绩? in this case, rodriguez is second when you search his name.在这种情况下,当你搜索他的名字时,罗德里格斯是第二位。

import requests
from bs4 import BeautifulSoup
import pandas as pd


playernames=['Carlos Delfino', 'Sergio Rodriguez']

result = pd.DataFrame()
for name in playernames:

    fname=name.split(" ")[0]
    lname=name.split(" ")[1]
    url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
    response = requests.get(url)

    soup = BeautifulSoup(response.content, 'html.parser')


    try:
        table1 = soup.find('h2',text='International Regular Season Stats - Per Game').findNext('table')
        table2 = soup.find('h2',text='International Regular Season Stats - Advanced Stats').findNext('table')

        df1 = pd.read_html(str(table1))[0]
        df2 = pd.read_html(str(table2))[0]

        commonCols = list(set(df1.columns) & set(df2.columns))
        df = df1.merge(df2, how='left', on=commonCols)
        df['Player'] = name

    except:
        print ('No international table for %s.' %name)
        df = pd.DataFrame([name], columns=['Player'])

    result = result.append(df, sort=False).reset_index(drop=True)

cols = list(result.columns)
cols = [cols[-1]] + cols[:-1]
result = result[cols]
result.to_csv('international players.csv', index=False)

Check the URL of the page you receive, a search which results in a single match directs you to检查您收到的页面的 URL,搜索结果为单个匹配项会将您定向到

https://basketball.realgm.com/player/{player-name}/Summary/{player-id}

but when there is more than one result you get但是当你得到不止一个结果时

https://basketball.realgm.com/search?q={player-name}

Write a parser function for both urls, such as (pseudocode)为两个url写一个解析器函数,比如(伪代码)

...
for name in playernames:
    fname=name.split(" ")[0]
    lname=name.split(" ")[1]
    url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    # check the response url
    if (response.url == "https://basketball.realgm.com/search..."):
        # parse the search results, finding the players you want
        ... get urls from the table ...
        soup.table...  # etc.
        foreach url in table:
            response = requests.get(player_url)
            soup = BeautifulSoup(response.content, 'html.parser')
            # call the parse function for a player page
            ...
            parse_player(soup)
    else: # we have a player page
        # call the parse function for a player page, same as above
        ...
        parse_player(soup)
    ...

There's a small amount of code duplication, but while you get your head around it and make it work don't worry about that.有少量的代码重复,但是当您了解它并使其工作时,请不要担心。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM