[英]get the difference between max and min for a groupby in pandas and calculate the average
I have one dataframe like below:我有一个如下所示的数据框:
ticker fy fp value f_date rn
MSFT 2009 0 144 2010-01-01T12:12:34 0
AAPL 2010 0 144 2010-01-01T12:12:34 0
MSFT 2009 0 48 2014-05-01T12:12:34 1
AAPL 2011 0 80 2012-01-01T12:12:34 1
GOOG 2010 0 40 2010-01-01T12:12:34 0
I just want to groupby this data on the basis ticker
, fy
, fp
just like below我只想根据
ticker
、 fy
、 fp
对这些数据进行fy
,如下所示
df.groupby(by=['ticker', 'fy', 'fp'])
On the basis of this, i just want to calculate the difference of max
and min
of f_date
and divide it by max of rn
.在此基础上,我只想计算
f_date
的max
和min
的f_date
并将其除以max of rn
的max of rn
。 For example, group MSFT, 2009, 0
, max date is 2014-05-01T12:12:34
and min date is 2010-01-01T12:12:34
, and the max rn
is 1, so i want to calculate it as max(f_date) - min(f_date)/ max(rn+1)
.例如,组
MSFT, 2009, 0
,最大日期为2014-05-01T12:12:34
,最小日期为2010-01-01T12:12:34
,最大rn
为 1,所以我想将其计算为max(f_date) - min(f_date)/ max(rn+1)
。 so i'll get the days inbetween of these two dates, hence i can map this data with other to do some analysis所以我会得到这两个日期之间的天数,因此我可以将这些数据与其他数据进行映射以进行一些分析
i'm unable to move forward after groupby.在groupby之后我无法前进。
For pandas 0.25+ is possible use named aggregations , then subtract and divide columns:对于 0.25+ 的熊猫,可以使用命名聚合,然后减去和划分列:
df['f_date'] = pd.to_datetime(df['f_date'])
df = df.groupby(by=['ticker', 'fy', 'fp']).agg(min1=('f_date','min'),
max1=('f_date','max'),
rn=('rn', 'max'))
df['new'] = df['max1'].sub(df['min1']).div(df['rn'].add(1))
print (df)
min1 max1 rn new
ticker fy fp
AAPL 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0 0 days 00:00:00
2011 0 2012-01-01 12:12:34 2012-01-01 12:12:34 1 0 days 00:00:00
GOOG 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0 0 days 00:00:00
MSFT 2009 0 2010-01-01 12:12:34 2014-05-01 12:12:34 1 790 days 12:00:00
Or if necessary convert difference of datetimes (timedeltas) to seconds by Series.dt.total_seconds
:或者,如有必要,通过
Series.dt.total_seconds
将日期时间(timedeltas)的差异转换为秒:
df['new1'] = df['max1'].sub(df['min1']).dt.total_seconds().div(df['rn'].add(1))
print (df)
min1 max1 rn new
ticker fy fp
AAPL 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0 0.0
2011 0 2012-01-01 12:12:34 2012-01-01 12:12:34 1 0.0
GOOG 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0 0.0
MSFT 2009 0 2010-01-01 12:12:34 2014-05-01 12:12:34 1 68299200.0
Solution for oldier pandas versions:较旧的熊猫版本的解决方案:
df['f_date'] = pd.to_datetime(df['f_date'])
df = df.groupby(by=['ticker', 'fy', 'fp']).agg({'f_date':['min','max'],
'rn':'max'})
df.columns = df.columns.map('_'.join)
df['new'] = df['f_date_max'].sub(df['f_date_min']).div(df['rn_max'].add(1))
print (df)
f_date_min f_date_max rn_max \
ticker fy fp
AAPL 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0
2011 0 2012-01-01 12:12:34 2012-01-01 12:12:34 1
GOOG 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0
MSFT 2009 0 2010-01-01 12:12:34 2014-05-01 12:12:34 1
new
ticker fy fp
AAPL 2010 0 0 days 00:00:00
2011 0 0 days 00:00:00
GOOG 2010 0 0 days 00:00:00
MSFT 2009 0 790 days 12:00:00
Last if necessary convert MultiIndex
to columns:最后,如有必要,将
MultiIndex
转换为列:
df = df.reset_index()
print (df)
ticker fy fp f_date_min f_date_max rn_max \
0 AAPL 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0
1 AAPL 2011 0 2012-01-01 12:12:34 2012-01-01 12:12:34 1
2 GOOG 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0
3 MSFT 2009 0 2010-01-01 12:12:34 2014-05-01 12:12:34 1
new
0 0 days 00:00:00
1 0 days 00:00:00
2 0 days 00:00:00
3 790 days 12:00:00
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.