[英]group by and subtract first occurrence and last occurrence in pandas
我在熊猫中有以下数据框
code date time dip flag tank qty
123 2018-12-23 08:00:00 389 0 1 1300
123 2018-12-23 09:00:00 380 0 1 1250
123 2018-12-23 10:00:00 378 0 1 1200
123 2018-12-23 11:00:00 345 1 1 1150
123 2018-12-23 12:00:00 342 1 1 1100
123 2018-12-23 13:00:00 340 1 1 1050
123 2018-12-23 14:00:00 338 1 1 1000
123 2018-12-23 15:00:00 380 0 1 1500
123 2018-12-23 16:00:00 340 1 1 1000
123 2018-12-23 17:00:00 340 1 1 1000
123 2018-12-23 08:00:00 389 0 2 1300
123 2018-12-23 09:00:00 380 0 2 1250
123 2018-12-23 10:00:00 378 0 2 1200
123 2018-12-23 11:00:00 345 1 2 1150
123 2018-12-23 12:00:00 342 1 2 1100
123 2018-12-23 13:00:00 340 1 2 1050
123 2018-12-23 14:00:00 338 1 2 1000
我想找出dip
低于 350 的次数,直到什么时间(以小时为单位)保持低于 350,以及低于 350 时的销售数量是我想要的数据帧。 当下降小于 350 时,我已经将标志设置为 1
code date tank frequency qty_sold time
123 2018-12-23 1 4 150 3
123 2018-12-23 2 4 150 3
我可以通过 groupby 找到频率。 需要一些帮助才能找到另外两个
df_agg= df.groupby(['code','date','tank']).agg({'flag':['sum']}).reset_index()
用:
#create datetimes column
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])
#add aggregation by first and last
df_agg= df[df['dip'] < 350].groupby(['code','date','tank']).agg({'flag':['sum'],
'datetime':['first','last'],
'qty':['first','last']})
#flatten MultiIndex
df_agg.columns = df_agg.columns.map('_'.join)
#substract columns, timedeltas convert to hours
df_agg['qty_sold'] = df_agg.pop('qty_first') - df_agg.pop('qty_last')
df_agg['time'] = (df_agg.pop('datetime_last') - df_agg.pop('datetime_first'))
.dt.total_seconds().div(3600).astype(int)
#rename column and create default index
df_agg = df_agg.rename(columns={'flag_size':'frequency'}).reset_index()
print (df_agg)
code date tank flag_sum qty_sold time
0 123 2018-12-23 1 4 150 3
1 123 2018-12-23 2 4 150 3
编辑:
如果date
或time
值中没有缺失值且date
time
频率相差一小时,则解决方案有效。
如果差异更像是1
小时和前 3 个级别的最后总和,则想法是为组创建新的辅助列g
:
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])
df_agg= df[df['dip'] < 350].copy()
df_agg['g'] = (df_agg.groupby(['code','date','tank'])['datetime'].diff()
.ne(pd.Timedelta(1, 'H'))
.cumsum())
df_agg= df_agg.groupby(['code','date','tank','g']).agg({'flag':['sum'],
'datetime':['first','last'],
'qty':['first','last']})
df_agg.columns = df_agg.columns.map('_'.join)
df_agg['qty_sold'] = df_agg.pop('qty_first') - df_agg.pop('qty_last')
df_agg['time'] = ((df_agg.pop('datetime_last') - df_agg.pop('datetime_first'))
.dt.total_seconds().div(3600).astype(int))
df_agg = (df_agg.rename(columns={'flag_size':'frequency'})
.sum(level=[0,1,2])
.reset_index()
)
print (df_agg)
code date tank flag_sum qty_sold time
0 123 2018-12-23 1 6 150 4
1 123 2018-12-23 2 4 150 3
你可以做:
# to get till what time (hour)
df.loc[df['dip'].lt(350),'time'].dt.hour.max()
# what is the quantity sold
df.loc[df['dip'].lt(350),'qty'].sum()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.