简体   繁体   中英

Pandas averaging selected rows and columns

I am working with some EPL stats. I have csv with all matches from one season in following format.

D           H                A           H_SC  A_SC H_ODDS  D_ODDS  A_ODDS...
11.05.2014  Norwich          Arsenal     0     2    5.00    4.00    1.73 
11.05.2014  Chelsea          Swansea     0     0    1.50    3.00    5.00     

What I would like to do is for each match calculate average stats of teams from N previous matches. The result should look something like this.

D           H        A        H_SC         A_SC         H_ODDS  D_ODDS  A_ODDS...
11.05.2014  Norwich  Arsenal  avgNorwichSC avgArsenalSC 5.00    4.00    1.73
11.05.2014  Chelsea  Swansea  avgChelseaSC avgSwanseaSC 1.50    3.00    5.00 

So the date, teams and odds remains untouched and other stats are replaced with average from N previous matches. EDIT: The matches from first N rounds should not be in final table because there is not enough data to calculate averages.

The most tricky part for me is that the stats I am averaging have different prefix (H_ or A_) depending on where was the match played.

All I managed to do for now is to create dictionary, where key is club name and value is DataFrame containing all matches played by club.

D           H        A          H_SC  A_SC  H_ODDS  D_ODDS  A_ODDS...
11.05.2014  Norwich  Arsenal    0     2     5.00    4.00    1.73
04.05.2014  Arsenal  West Brom  1     0     1.40    5.25    8.00 

I have also previously coded this without pandas, but I was not satisfied with the code and i would like to learn pandas :).

You say you want to learn pandas, so I've given a few examples (tested with similar data) to get you going along the right track. It's a bit of an opinion, but I think finding the last N games is hard, so I'll initially assume / pretend you want to find averages over the whole table at first. If finding "last N" is really import, I can add to the answer. This should get you going with pandas and gropuby - I've left prints in so you can understand what's going on.

import pandas

EPL_df = pandas.DataFrame.from_csv('D:\\EPLstats.csv')
#Find most recent date for each team
EPL_df['D'] = pandas.to_datetime(EPL_df['D'])
homeGroup = EPL_df.groupby('H')
awayGroup = EPL_df.groupby('A')

#Following will give you dataframes, team against last game, home and away
homeLastGame = homeGroup['D'].max()
awayLastGame = awayGroup['D'].max()
teamLastGame = pandas.concat([homeLastGame, awayLastGame]).reset_index().groupby('index')['D'].max()
print teamLastGame

homeAveScore = homeGroup['H_SC'].mean()
awayAveScore = awayGroup['A_SC'].mean()
teamAveScore = (homeGroup['H_SC'].sum() + awayGroup['A_SC'].sum()) / (homeGroup['H_SC'].count() + awayGroup['A_SC'].count()) 

print teamAveScore

You now have average scores for each team along with their most recent match dates. All you have to do now is select the relevant rows of the original dataframe using the most recent dates (ie eveything apart from the score columns) and then select from the average score dataframes using the team names from that row.

eg

recentRows = EPL_df.loc[EPL_df['D'] > pandas.to_datetime("2015/01/10")]

print recentRows

def insertAverages(s):
    a = teamAveScore[s['H']] 
    b = teamAveScore[s['A']]
    print a,b
    return pandas.Series(dict(H_AVSC=a, A_AVSC=b))

finalTable = pandas.concat([recentRows, recentRows.apply(insertAverages, axis = 1)], axis=1)

print finalTable

finalTable has your original odds etc for the most recent games with two extra columns (H_AVSC and A_AVSC) for the average scores of home and away teams involved in those matches

Edit

Couple of gotchas

  1. just noticed I didn't put a format string in to_datetime() . For your dates - they look like UK format with dots so you should do

    EPL_df['D'] = pandas.to_datetime(EPL_df['D'], format='%d.%m.%Y')

  2. You could use the minimum of the dates in teamLastGame instead of the hard coded 2015/01/10 in my example.

  3. If you really need to replace column H_SC with H_AVSC in your finalTable , rather than add on the averages:

     newCols = recentRows.apply(insertAverages, axis = 1)
    recentRows['H_SC'] = newCols['H_AVSC'] recentRows['A_SC'] = newCols['A_AVSC']
    print recentRows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM