[英]pandas groupby and replace rows values based on a condition
I have this example df:我有这个例子df:
df = pd.DataFrame({'customer id':[1,1,1,],
'Date':['2022-09-05 08:38:37.000' ,'2022-09-06 08:38:37.000','2022-09-07 08:38:37.000'],
'country':['US','US','US'],
'step1_check':['step1',np.nan,np.nan],
'step2_check':[np.nan,'step2',np.nan],
'step3_check':[np.nan,np.nan,'step3']})
It is similar to a log for each step with date and time.它类似于带有日期和时间的每个步骤的日志。 I want to group by customer to get one row per each customer and replace each step(n)_check
with the time stamp.我想按客户分组,为每个客户获取一行,并将每个step(n)_check
替换为时间戳。
I was able to achieve that with classical (inefficient) solution:我能够通过经典(低效)解决方案实现这一目标:
In the example df, there are 3 steps_check columns, so I want to track the time stamp:在示例 df 中,有 3 个 steps_check 列,所以我想跟踪时间戳:
df['step1_date'] = np.nan
df['step2_date'] = np.nan
df['step3_date'] = np.nan
Then made an np.where
condition to replace the step date if not null如果不是 null,则创建一个np.where
条件来替换步骤日期
df['step1_date'] = np.where(df['step1_check'].notna(), df['Date'], np.nan )
df['step2_date'] = np.where(df['step2_check'].notna(), df['Date'], np.nan )
df['step3_date'] = np.where(df['step3_check'].notna(), df['Date'], np.nan )
finally, grouped by customer id
to get one row for each customer with number of steps and the dates:最后,按customer id
分组,为每个客户获取一行,其中包含步骤数和日期:
df.groupby(['customer id','country']).agg({'step1_date':'first','step2_date':'first','step3_date':'first'}).reset_index()
output: output:
customer id country step1_date step2_date step3_date
0 1 US 2022-09-05 08:38:37.000 2022-09-06 08:38:37.000 2022-09-07 08:38:37.000
What is the best approach to automate this for many more steps?为更多步骤自动执行此操作的最佳方法是什么? it will be inefficient to write many np.where
s conditions for each column为每列编写许多np.where
条件将是低效的
Filter out the steps column, and forward fill them on axis=1 and assign back to the dataframe.过滤掉steps列,并在axis=1上向前填充它们并分配回dataframe。 Then privot the dataframe, finally add suffix to column name.然后privot dataframe,最后为列名添加后缀。
steps=df.filter(like='step').ffill(axis=1)
df[steps.columns] = steps
df.pivot('customer id', steps.columns[-1], 'Date').add_suffix('_date')
OUTPUT OUTPUT
step3_check step1_date step2_date step3_date
customer id
1 2022-09-05 08:38:37.000 2022-09-06 08:38:37.000 2022-09-07 08:38:37.000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.