[英]group by and subtract first occurrence and last occurrence in pandas
我在熊貓中有以下數據框
code date time dip flag tank qty
123 2018-12-23 08:00:00 389 0 1 1300
123 2018-12-23 09:00:00 380 0 1 1250
123 2018-12-23 10:00:00 378 0 1 1200
123 2018-12-23 11:00:00 345 1 1 1150
123 2018-12-23 12:00:00 342 1 1 1100
123 2018-12-23 13:00:00 340 1 1 1050
123 2018-12-23 14:00:00 338 1 1 1000
123 2018-12-23 15:00:00 380 0 1 1500
123 2018-12-23 16:00:00 340 1 1 1000
123 2018-12-23 17:00:00 340 1 1 1000
123 2018-12-23 08:00:00 389 0 2 1300
123 2018-12-23 09:00:00 380 0 2 1250
123 2018-12-23 10:00:00 378 0 2 1200
123 2018-12-23 11:00:00 345 1 2 1150
123 2018-12-23 12:00:00 342 1 2 1100
123 2018-12-23 13:00:00 340 1 2 1050
123 2018-12-23 14:00:00 338 1 2 1000
我想找出dip
低於 350 的次數,直到什么時間(以小時為單位)保持低於 350,以及低於 350 時的銷售數量是我想要的數據幀。 當下降小於 350 時,我已經將標志設置為 1
code date tank frequency qty_sold time
123 2018-12-23 1 4 150 3
123 2018-12-23 2 4 150 3
我可以通過 groupby 找到頻率。 需要一些幫助才能找到另外兩個
df_agg= df.groupby(['code','date','tank']).agg({'flag':['sum']}).reset_index()
用:
#create datetimes column
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])
#add aggregation by first and last
df_agg= df[df['dip'] < 350].groupby(['code','date','tank']).agg({'flag':['sum'],
'datetime':['first','last'],
'qty':['first','last']})
#flatten MultiIndex
df_agg.columns = df_agg.columns.map('_'.join)
#substract columns, timedeltas convert to hours
df_agg['qty_sold'] = df_agg.pop('qty_first') - df_agg.pop('qty_last')
df_agg['time'] = (df_agg.pop('datetime_last') - df_agg.pop('datetime_first'))
.dt.total_seconds().div(3600).astype(int)
#rename column and create default index
df_agg = df_agg.rename(columns={'flag_size':'frequency'}).reset_index()
print (df_agg)
code date tank flag_sum qty_sold time
0 123 2018-12-23 1 4 150 3
1 123 2018-12-23 2 4 150 3
編輯:
如果date
或time
值中沒有缺失值且date
time
頻率相差一小時,則解決方案有效。
如果差異更像是1
小時和前 3 個級別的最后總和,則想法是為組創建新的輔助列g
:
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])
df_agg= df[df['dip'] < 350].copy()
df_agg['g'] = (df_agg.groupby(['code','date','tank'])['datetime'].diff()
.ne(pd.Timedelta(1, 'H'))
.cumsum())
df_agg= df_agg.groupby(['code','date','tank','g']).agg({'flag':['sum'],
'datetime':['first','last'],
'qty':['first','last']})
df_agg.columns = df_agg.columns.map('_'.join)
df_agg['qty_sold'] = df_agg.pop('qty_first') - df_agg.pop('qty_last')
df_agg['time'] = ((df_agg.pop('datetime_last') - df_agg.pop('datetime_first'))
.dt.total_seconds().div(3600).astype(int))
df_agg = (df_agg.rename(columns={'flag_size':'frequency'})
.sum(level=[0,1,2])
.reset_index()
)
print (df_agg)
code date tank flag_sum qty_sold time
0 123 2018-12-23 1 6 150 4
1 123 2018-12-23 2 4 150 3
你可以做:
# to get till what time (hour)
df.loc[df['dip'].lt(350),'time'].dt.hour.max()
# what is the quantity sold
df.loc[df['dip'].lt(350),'qty'].sum()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.