I am working with some EPL stats. I have csv with all matches from one season in following format.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal 0 2 5.00 4.00 1.73
11.05.2014 Chelsea Swansea 0 0 1.50 3.00 5.00
What I would like to do is for each match calculate average stats of teams from N previous matches. The result should look something like this.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal avgNorwichSC avgArsenalSC 5.00 4.00 1.73
11.05.2014 Chelsea Swansea avgChelseaSC avgSwanseaSC 1.50 3.00 5.00
So the date, teams and odds remains untouched and other stats are replaced with average from N previous matches. EDIT: The matches from first N rounds should not be in final table because there is not enough data to calculate averages.
The most tricky part for me is that the stats I am averaging have different prefix (H_ or A_) depending on where was the match played.
All I managed to do for now is to create dictionary, where key is club name and value is DataFrame containing all matches played by club.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal 0 2 5.00 4.00 1.73
04.05.2014 Arsenal West Brom 1 0 1.40 5.25 8.00
I have also previously coded this without pandas, but I was not satisfied with the code and i would like to learn pandas :).
You say you want to learn pandas, so I've given a few examples (tested with similar data) to get you going along the right track. It's a bit of an opinion, but I think finding the last N
games is hard, so I'll initially assume / pretend you want to find averages over the whole table at first. If finding "last N" is really import, I can add to the answer. This should get you going with pandas and gropuby - I've left prints in so you can understand what's going on.
import pandas
EPL_df = pandas.DataFrame.from_csv('D:\\EPLstats.csv')
#Find most recent date for each team
EPL_df['D'] = pandas.to_datetime(EPL_df['D'])
homeGroup = EPL_df.groupby('H')
awayGroup = EPL_df.groupby('A')
#Following will give you dataframes, team against last game, home and away
homeLastGame = homeGroup['D'].max()
awayLastGame = awayGroup['D'].max()
teamLastGame = pandas.concat([homeLastGame, awayLastGame]).reset_index().groupby('index')['D'].max()
print teamLastGame
homeAveScore = homeGroup['H_SC'].mean()
awayAveScore = awayGroup['A_SC'].mean()
teamAveScore = (homeGroup['H_SC'].sum() + awayGroup['A_SC'].sum()) / (homeGroup['H_SC'].count() + awayGroup['A_SC'].count())
print teamAveScore
You now have average scores for each team along with their most recent match dates. All you have to do now is select the relevant rows of the original dataframe using the most recent dates (ie eveything apart from the score columns) and then select from the average score dataframes using the team names from that row.
eg
recentRows = EPL_df.loc[EPL_df['D'] > pandas.to_datetime("2015/01/10")]
print recentRows
def insertAverages(s):
a = teamAveScore[s['H']]
b = teamAveScore[s['A']]
print a,b
return pandas.Series(dict(H_AVSC=a, A_AVSC=b))
finalTable = pandas.concat([recentRows, recentRows.apply(insertAverages, axis = 1)], axis=1)
print finalTable
finalTable has your original odds etc for the most recent games with two extra columns (H_AVSC and A_AVSC) for the average scores of home and away teams involved in those matches
Couple of gotchas
just noticed I didn't put a format string in to_datetime()
. For your dates - they look like UK format with dots so you should do
EPL_df['D'] = pandas.to_datetime(EPL_df['D'], format='%d.%m.%Y')
You could use the minimum of the dates in teamLastGame
instead of the hard coded 2015/01/10
in my example.
If you really need to replace column H_SC
with H_AVSC
in your finalTable
, rather than add on the averages:
newCols = recentRows.apply(insertAverages, axis = 1)
recentRows['H_SC'] = newCols['H_AVSC'] recentRows['A_SC'] = newCols['A_AVSC']
print recentRows
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.