简体   繁体   中英

Python Appending DataFrame, weird for loop error

I'm working on some NFL statistics web scraping, honestly the activity doesn't matter much. I spent a ton of time debugging because I couldn't believe what it was doing, either I'm going crazy or there is some sort of bug in a package or python itself. Here's the code I'm working with:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
import string
import numpy as np

#get player list
players = pd.DataFrame({"name":[],"url":[],"positions":[],"startYear":[],"endYear":[]})
letters = list(string.ascii_uppercase)
for letter in letters:
    print(letter)
    players_html = requests.get("https://www.pro-football-reference.com/players/"+letter+"/")
    soup = bs(players_html.content,"html.parser")
    for player in soup.find("div",{"id":"div_players"}).find_all("p"):
        temp_row = {}
        temp_row["url"] = "https://www.pro-football-reference.com"+player.find("a")["href"]
        temp_row["name"] = player.text.split("(")[0].strip()
        years = player.text.split(")")[1].strip()
        temp_row["startYear"] = int(years.split("-")[0])
        temp_row["endYear"] = int(years.split("-")[1])
        temp_row["positions"] = player.text.split("(")[1].split(")")[0]
        players = players.append(temp_row,ignore_index=True)
players = players[players.endYear > 2000]
players.reset_index(inplace=True,drop=True)

game_df = pd.DataFrame()
def apply_test(row):
    #print(row)
    url = row['url']
    #print(list(range(int(row['startYear']),int(row['endYear'])+1)))
    for yr in range(int(row['startYear']),int(row['endYear'])+1):
        print(yr)
        content = requests.get(url.split(".htm")[0]+"/gamelog/"+str(yr)).content
        soup = bs(content,'html.parser').find("div",{"id":"all_stats"})
        #overheader
        over_headers = []
        for over in soup.find("thead").find("tr").find_all("th"):
            if("colspan" in over.attrs.keys()):
                for i in range(0,int(over['colspan'])):
                    over_headers = over_headers + [over.text]
            else:
                over_headers = over_headers + [over.text]
        #headers
        headers = []
        for header in soup.find("thead").find_all("tr")[1].find_all("th"):
            headers = headers + [header.text]
        all_headers = [a+"___"+b for a,b in zip(over_headers,headers)]
        #remove first column, it's meaningless
        all_headers = all_headers[1:len(all_headers)]
        for row in soup.find("tbody").find_all("tr"):
            temp_row = {}
            for i,col in enumerate(row.find_all("td")):
                temp_row[all_headers[i]] = col.text
            game_df = game_df.append(temp_row,ignore_index=True)
players.apply(apply_test,axis=1)


Now again I could get into what I'm trying to do, but there seems to be a much higher-level issue here. startYear and endYear in the for loop are 2013 and 2014, so the loop should be setting the yr variable to 2013 then 2014. But when you look at what prints out due to the print(yr) , you realize it's printing out 2013 twice. But if you simply comment out the game_df = game_df.append(temp_row,ignore_index=True) line, the printouts of yr are correct. There is an error shortly after the first two lines, but that is expected and one I am comfortable debugging. But the fact that appending to a global dataframe is causing a for loop to behave differently is blowing my mind right now. Can someone help with this?

Thanks.

I don't really follow what the overall aim is but I do note two things:

  1. You either need the local game_df to be declared as global game_df before game_df = game_df.append(temp_row,ignore_index=True) or better still pass as an arg in the def signature though you would need to amend this: players.apply(apply_test,axis=1) accordingly.

  2. You need to handle the cases of find returning None eg with soup.find("thead").find_all("tr")[1].find_all("th") for page https://www.pro-football-reference.com/players/A/AaitIs00/gamelog/2014 . Perhaps put in try except blocks with appropriate default values to be supplied.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM