简体   繁体   English

pandas groupby 并根据条件替换行值

[英]pandas groupby and replace rows values based on a condition

I have this example df:我有这个例子df:

df = pd.DataFrame({'customer id':[1,1,1,],
                   'Date':['2022-09-05 08:38:37.000'    ,'2022-09-06 08:38:37.000','2022-09-07 08:38:37.000'],
                   'country':['US','US','US'],
                   'step1_check':['step1',np.nan,np.nan],
                   'step2_check':[np.nan,'step2',np.nan],
                   'step3_check':[np.nan,np.nan,'step3']})

It is similar to a log for each step with date and time.它类似于带有日期和时间的每个步骤的日志。 I want to group by customer to get one row per each customer and replace each step(n)_check with the time stamp.我想按客户分组,为每个客户获取一行,并将每个step(n)_check替换为时间戳。

I was able to achieve that with classical (inefficient) solution:我能够通过经典(低效)解决方案实现这一目标:

In the example df, there are 3 steps_check columns, so I want to track the time stamp:在示例 df 中,有 3 个 steps_check 列,所以我想跟踪时间戳:

df['step1_date'] = np.nan
df['step2_date'] = np.nan
df['step3_date'] = np.nan

Then made an np.where condition to replace the step date if not null如果不是 null,则创建一个np.where条件来替换步骤日期

df['step1_date'] = np.where(df['step1_check'].notna(), df['Date'], np.nan )
df['step2_date'] = np.where(df['step2_check'].notna(), df['Date'], np.nan )
df['step3_date'] = np.where(df['step3_check'].notna(), df['Date'], np.nan )

finally, grouped by customer id to get one row for each customer with number of steps and the dates:最后,按customer id分组,为每个客户获取一行,其中包含步骤数和日期:

df.groupby(['customer id','country']).agg({'step1_date':'first','step2_date':'first','step3_date':'first'}).reset_index()

output: output:

 customer id country               step1_date               step2_date      step3_date 
0            1      US  2022-09-05 08:38:37.000  2022-09-06 08:38:37.000     2022-09-07 08:38:37.000

What is the best approach to automate this for many more steps?为更多步骤自动执行此操作的最佳方法是什么? it will be inefficient to write many np.where s conditions for each column为每列编写许多np.where条件将是低效的

Filter out the steps column, and forward fill them on axis=1 and assign back to the dataframe.过滤掉steps列,并在axis=1上向前填充它们并分配回dataframe。 Then privot the dataframe, finally add suffix to column name.然后privot dataframe,最后为列名添加后缀。

steps=df.filter(like='step').ffill(axis=1)
df[steps.columns] = steps
df.pivot('customer id', steps.columns[-1], 'Date').add_suffix('_date')

OUTPUT OUTPUT

step3_check                step1_date                step2_date                  step3_date
customer id                                                                                
1          2022-09-05 08:38:37.000  2022-09-06 08:38:37.000    2022-09-07 08:38:37.000   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM