简体   繁体   English

分组并减去熊猫中的第一次出现和最后一次出现

[英]group by and subtract first occurrence and last occurrence in pandas

I have following dataframe in pandas我在熊猫中有以下数据框

 code    date         time         dip     flag   tank   qty
 123     2018-12-23   08:00:00     389     0      1      1300
 123     2018-12-23   09:00:00     380     0      1      1250
 123     2018-12-23   10:00:00     378     0      1      1200
 123     2018-12-23   11:00:00     345     1      1      1150
 123     2018-12-23   12:00:00     342     1      1      1100
 123     2018-12-23   13:00:00     340     1      1      1050
 123     2018-12-23   14:00:00     338     1      1      1000
 123     2018-12-23   15:00:00     380     0      1      1500
 123     2018-12-23   16:00:00     340     1      1      1000
 123     2018-12-23   17:00:00     340     1      1      1000
 123     2018-12-23   08:00:00     389     0      2      1300
 123     2018-12-23   09:00:00     380     0      2      1250
 123     2018-12-23   10:00:00     378     0      2      1200
 123     2018-12-23   11:00:00     345     1      2      1150
 123     2018-12-23   12:00:00     342     1      2      1100
 123     2018-12-23   13:00:00     340     1      2      1050
 123     2018-12-23   14:00:00     338     1      2      1000

I want to find how many times dip is below 350, till what time(in hours) it remained below 350 and what is the quantity sold when below 350 Below is my desired dataframe.我想找出dip低于 350 的次数,直到什么时间(以小时为单位)保持低于 350,以及低于 350 时的销售数量是我想要的数据帧。 I have already set the flag as 1 when there is a dip less than 350当下降小于 350 时,我已经将标志设置为 1

 code    date        tank     frequency    qty_sold    time
 123     2018-12-23  1        4            150         3
 123     2018-12-23  2        4            150         3

I am able to find the frequency with groupby.我可以通过 groupby 找到频率。 need some help in finding other two需要一些帮助才能找到另外两个

  df_agg= df.groupby(['code','date','tank']).agg({'flag':['sum']}).reset_index()

Use:用:

#create datetimes column
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])

#add aggregation by first and last 
df_agg= df[df['dip'] < 350].groupby(['code','date','tank']).agg({'flag':['sum'], 
                                                                'datetime':['first','last'],
                                                                'qty':['first','last']})
#flatten MultiIndex
df_agg.columns = df_agg.columns.map('_'.join)

#substract columns, timedeltas convert to hours
df_agg['qty_sold'] = df_agg.pop('qty_first') - df_agg.pop('qty_last') 
df_agg['time'] = (df_agg.pop('datetime_last') - df_agg.pop('datetime_first'))
                       .dt.total_seconds().div(3600).astype(int)
#rename column and create default index
df_agg = df_agg.rename(columns={'flag_size':'frequency'}).reset_index()

print (df_agg)
   code        date  tank  flag_sum  qty_sold  time
0   123  2018-12-23     1         4       150     3
1   123  2018-12-23     2         4       150     3

EDIT:编辑:

Solution working if no missing values in date or time values and frequency of datetimes is one hour difference.如果datetime值中没有缺失值且date time频率相差一小时,则解决方案有效。

Idea is create new helper column g for groups if difference is more like 1 hour and last aggregate sum per first 3 levels:如果差异更像是1小时和前 3 个级别的最后总和,则想法是为组创建新的辅助列g

df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])

df_agg= df[df['dip'] < 350].copy()

df_agg['g'] = (df_agg.groupby(['code','date','tank'])['datetime'].diff()
                     .ne(pd.Timedelta(1, 'H'))
                     .cumsum())

df_agg= df_agg.groupby(['code','date','tank','g']).agg({'flag':['sum'], 
                                                        'datetime':['first','last'],
                                                        'qty':['first','last']})
df_agg.columns = df_agg.columns.map('_'.join)
df_agg['qty_sold'] = df_agg.pop('qty_first') - df_agg.pop('qty_last') 
df_agg['time'] = ((df_agg.pop('datetime_last') - df_agg.pop('datetime_first'))
                         .dt.total_seconds().div(3600).astype(int))

df_agg = (df_agg.rename(columns={'flag_size':'frequency'})
                .sum(level=[0,1,2])
                .reset_index()
          )

print (df_agg)
   code        date  tank  flag_sum  qty_sold  time
0   123  2018-12-23     1         6       150     4
1   123  2018-12-23     2         4       150     3

You can do:你可以做:

# to get till what time (hour)
df.loc[df['dip'].lt(350),'time'].dt.hour.max()

# what is the quantity sold
df.loc[df['dip'].lt(350),'qty'].sum()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM