pandas groupby 并根据条件替换行值

Question

I have this example df:我有这个例子df：

df = pd.DataFrame({'customer id':[1,1,1,],
                   'Date':['2022-09-05 08:38:37.000'    ,'2022-09-06 08:38:37.000','2022-09-07 08:38:37.000'],
                   'country':['US','US','US'],
                   'step1_check':['step1',np.nan,np.nan],
                   'step2_check':[np.nan,'step2',np.nan],
                   'step3_check':[np.nan,np.nan,'step3']})

It is similar to a log for each step with date and time.它类似于带有日期和时间的每个步骤的日志。 I want to group by customer to get one row per each customer and replace each step(n)_check with the time stamp.我想按客户分组，为每个客户获取一行，并将每个step(n)_check替换为时间戳。

I was able to achieve that with classical (inefficient) solution:我能够通过经典（低效）解决方案实现这一目标：

In the example df, there are 3 steps_check columns, so I want to track the time stamp:在示例 df 中，有 3 个 steps_check 列，所以我想跟踪时间戳：

df['step1_date'] = np.nan
df['step2_date'] = np.nan
df['step3_date'] = np.nan

Then made an np.where condition to replace the step date if not null如果不是 null，则创建一个np.where条件来替换步骤日期

df['step1_date'] = np.where(df['step1_check'].notna(), df['Date'], np.nan )
df['step2_date'] = np.where(df['step2_check'].notna(), df['Date'], np.nan )
df['step3_date'] = np.where(df['step3_check'].notna(), df['Date'], np.nan )

finally, grouped by customer id to get one row for each customer with number of steps and the dates:最后，按customer id分组，为每个客户获取一行，其中包含步骤数和日期：

df.groupby(['customer id','country']).agg({'step1_date':'first','step2_date':'first','step3_date':'first'}).reset_index()

output: output：

 customer id country               step1_date               step2_date      step3_date 
0            1      US  2022-09-05 08:38:37.000  2022-09-06 08:38:37.000     2022-09-07 08:38:37.000

What is the best approach to automate this for many more steps?为更多步骤自动执行此操作的最佳方法是什么？ it will be inefficient to write many np.where s conditions for each column为每列编写许多np.where条件将是低效的

Answer 1

Filter out the steps column, and forward fill them on axis=1 and assign back to the dataframe.过滤掉steps列，并在axis=1上向前填充它们并分配回dataframe。 Then privot the dataframe, finally add suffix to column name.然后privot dataframe，最后为列名添加后缀。

steps=df.filter(like='step').ffill(axis=1)
df[steps.columns] = steps
df.pivot('customer id', steps.columns[-1], 'Date').add_suffix('_date')

OUTPUT OUTPUT

step3_check                step1_date                step2_date                  step3_date
customer id                                                                                
1          2022-09-05 08:38:37.000  2022-09-06 08:38:37.000    2022-09-07 08:38:37.000

pandas groupby 并根据条件替换行值

问题描述

1 个解决方案

解决方案1
4 已采纳 2022-09-24 01:49:27

pandas groupby 并根据条件替换行值

问题描述

1 个解决方案

解决方案1 4 已采纳 2022-09-24 01:49:27

解决方案1
4 已采纳 2022-09-24 01:49:27