[英]Use Groupby to Calculate Average if Date < X
I am trying to use a data frame that includes historical game statistics like the below df1, and build a second data frame that shows what the various column averages were going into each game (as I show in df2).我正在尝试使用一个包含历史游戏统计数据的数据框,如下面的 df1,并构建第二个数据框,显示每个游戏的各种列平均值(如我在 df2 中所示)。 How can I use grouby or something else to find the various averages for each team but only for games that have a date prior to the date in that specific row.
我如何使用 grouby 或其他东西来查找每个团队的各种平均值,但仅适用于日期早于该特定行中日期的游戏。 Example of historical games column:
历史游戏专栏示例:
Df1 = Date Team Opponent Points Points Against 1st Downs Win?
4/16/20 Eagles Ravens 10 20 10 0
2/10/20 Eagles Falcons 30 40 8 0
12/15/19 Eagles Cardinals 40 10 7 1
11/15/19 Eagles Giants 20 15 5 1
10/12/19 Jets Giants 10 18 2 1
Below is the dataframe that i'm trying to create.下面是我正在尝试创建的 dataframe。 As you can see, it is showing the averages for each column but only for the games that happened prior to each game.
如您所见,它显示了每列的平均值,但仅显示每场比赛之前发生的比赛。 Note: this is a simplified example of a much larger data set that i'm working with.
注意:这是我正在使用的更大数据集的简化示例。 In case the context helps, I'm trying to create this dataframe so I can analyze the correlation between the averages and whether the team won.
如果上下文有帮助,我正在尝试创建这个 dataframe 以便我可以分析平均值之间的相关性以及团队是否获胜。
Df2 = Date Team Opponent Avg Pts Avg Pts Against Avg 1st Downs Win %
4/16/20 Eagles Ravens 25.0 21.3 7.5 75%
2/10/20 Eagles Falcons 30.0 12.0 6.0 100%
12/15/19 Eagles Cardinals 20.0 15.0 5.0 100%
11/15/19 Eagles Giants NaN NaN NaN NaN
10/12/19 Jets Giants NaN NaN NaN NaN
Let me know if anything above isn't clear, appreciate the help.如果以上任何内容不清楚,请告诉我,感谢您的帮助。
The easiest way is to turn your dataframe into a Time Series.最简单的方法是将您的 dataframe 变成时间序列。 Run this for a file:
运行这个文件:
data=pd.read_csv(r'C:\Users\...csv',index_col='Date',parse_dates=True)
This is an example with a CSV file.这是 CSV 文件的示例。 You can run this after:
你可以在之后运行它:
data[:'#The Date you want to have all the dates before it']
If you want build a Series that has time indexed:如果你想建立一个有时间索引的系列:
index=pd.DatetimeIndex(['2014-07-04',...,'2015-08-04'])
data=pd.Series([0, 1, 2, 3], index=index)
Define your own function定义自己的function
def aggs_under_date(df, date):
first_team = df.Team.iloc[0]
first_opponent= df.Opponent.iloc[0]
if df.date.iloc[0] <= date:
avg_points = df.Points.mean()
avg_againts = df['Points Against'].mean()
avg_downs = df['1st Downs'].mean()
win_perc = f'{win_perc.sum()/win_perc.count()*100} %'
return [first_team, first_opponent, avg_points, avg_againts, avg_downs, win_perc]
else:
return [first_team, first_opponent, np.nan, np.nan, np.nan, np.nan]
And do the groupby
applying the function you just defined并通过应用您刚刚定义的
groupby
进行分组
date_max = pd.to_datetime('11/15/19')
Df1.groupby(['Date']).agg(aggs_under_date, date_max)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.