[英]group by and subtract first occurrence and last occurrence in pandas
I have following dataframe in pandas我在熊猫中有以下数据框
code date time dip flag tank qty
123 2018-12-23 08:00:00 389 0 1 1300
123 2018-12-23 09:00:00 380 0 1 1250
123 2018-12-23 10:00:00 378 0 1 1200
123 2018-12-23 11:00:00 345 1 1 1150
123 2018-12-23 12:00:00 342 1 1 1100
123 2018-12-23 13:00:00 340 1 1 1050
123 2018-12-23 14:00:00 338 1 1 1000
123 2018-12-23 15:00:00 380 0 1 1500
123 2018-12-23 16:00:00 340 1 1 1000
123 2018-12-23 17:00:00 340 1 1 1000
123 2018-12-23 08:00:00 389 0 2 1300
123 2018-12-23 09:00:00 380 0 2 1250
123 2018-12-23 10:00:00 378 0 2 1200
123 2018-12-23 11:00:00 345 1 2 1150
123 2018-12-23 12:00:00 342 1 2 1100
123 2018-12-23 13:00:00 340 1 2 1050
123 2018-12-23 14:00:00 338 1 2 1000
I want to find how many times dip
is below 350, till what time(in hours) it remained below 350 and what is the quantity sold when below 350 Below is my desired dataframe.我想找出
dip
低于 350 的次数,直到什么时间(以小时为单位)保持低于 350,以及低于 350 时的销售数量是我想要的数据帧。 I have already set the flag as 1 when there is a dip less than 350当下降小于 350 时,我已经将标志设置为 1
code date tank frequency qty_sold time
123 2018-12-23 1 4 150 3
123 2018-12-23 2 4 150 3
I am able to find the frequency with groupby.我可以通过 groupby 找到频率。 need some help in finding other two
需要一些帮助才能找到另外两个
df_agg= df.groupby(['code','date','tank']).agg({'flag':['sum']}).reset_index()
Use:用:
#create datetimes column
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])
#add aggregation by first and last
df_agg= df[df['dip'] < 350].groupby(['code','date','tank']).agg({'flag':['sum'],
'datetime':['first','last'],
'qty':['first','last']})
#flatten MultiIndex
df_agg.columns = df_agg.columns.map('_'.join)
#substract columns, timedeltas convert to hours
df_agg['qty_sold'] = df_agg.pop('qty_first') - df_agg.pop('qty_last')
df_agg['time'] = (df_agg.pop('datetime_last') - df_agg.pop('datetime_first'))
.dt.total_seconds().div(3600).astype(int)
#rename column and create default index
df_agg = df_agg.rename(columns={'flag_size':'frequency'}).reset_index()
print (df_agg)
code date tank flag_sum qty_sold time
0 123 2018-12-23 1 4 150 3
1 123 2018-12-23 2 4 150 3
EDIT:编辑:
Solution working if no missing values in date
or time
values and frequency of datetimes is one hour difference.如果
date
或time
值中没有缺失值且date
time
频率相差一小时,则解决方案有效。
Idea is create new helper column g
for groups if difference is more like 1
hour and last aggregate sum per first 3 levels:如果差异更像是
1
小时和前 3 个级别的最后总和,则想法是为组创建新的辅助列g
:
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])
df_agg= df[df['dip'] < 350].copy()
df_agg['g'] = (df_agg.groupby(['code','date','tank'])['datetime'].diff()
.ne(pd.Timedelta(1, 'H'))
.cumsum())
df_agg= df_agg.groupby(['code','date','tank','g']).agg({'flag':['sum'],
'datetime':['first','last'],
'qty':['first','last']})
df_agg.columns = df_agg.columns.map('_'.join)
df_agg['qty_sold'] = df_agg.pop('qty_first') - df_agg.pop('qty_last')
df_agg['time'] = ((df_agg.pop('datetime_last') - df_agg.pop('datetime_first'))
.dt.total_seconds().div(3600).astype(int))
df_agg = (df_agg.rename(columns={'flag_size':'frequency'})
.sum(level=[0,1,2])
.reset_index()
)
print (df_agg)
code date tank flag_sum qty_sold time
0 123 2018-12-23 1 6 150 4
1 123 2018-12-23 2 4 150 3
You can do:你可以做:
# to get till what time (hour)
df.loc[df['dip'].lt(350),'time'].dt.hour.max()
# what is the quantity sold
df.loc[df['dip'].lt(350),'qty'].sum()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.