简体   繁体   English

如何为 pandas 中的每个用户填写最后一个非空值?

[英]How to fill last non-null value for each user in pandas?

I have a df with user journeys that show purchase amounts of products.我有一个 df,其中包含显示产品购买量的用户旅程。 Now, I want to fill the last non-null value for each user, since users do not buy every day.现在,我想为每个用户填写最后一个非空值,因为用户不会每天都购买。 currently, I have:目前,我有:

date       | user_id | purchase_value
2020-01-01 | 1       | null
2020-01-02 | 1       | 1
2020-01-03 | 1       | null
2020-01-04 | 1       | 4
2020-01-01 | 2       | 55
2020-01-02 | 2       | null

I want it to look like this:我希望它看起来像这样:

date       | user_id | purchase_value
2020-01-01 | 1       | null
2020-01-02 | 1       | 1
2020-01-03 | 1       | 1
2020-01-04 | 1       | 4
2020-01-01 | 2       | 55
2020-01-02 | 2       | 55

Explanation: For user 1, we fill 1 on 2020-01-03 since this was the last non-null value on 2020-01-02.说明:对于用户 1,我们在 2020 年 1 月 3 日填写 1,因为这是 2020 年 1 月 2 日的最后一个非空值。 For user 2, we fill in 55 on 2020-01-02 since this was the last non-null value on 2020-01-01.对于用户 2,我们在 2020 年 1 月 2 日填写 55,因为这是 2020 年 1 月 1 日的最后一个非空值。

How would I do this in pandas for each user_id and date?对于每个 user_id 和日期,我将如何在 pandas 中执行此操作? Also, the dates do not have to be sequential.此外,日期不必是连续的。 ie there can be gaps in the dates, in that case always fill in the last non-null value (whenever that was).即日期中可能存在空白,在这种情况下,请始终填写最后一个非空值(无论何时)。

If you really want to ffill only the last NaN per group you need to identify it, then replace with its ffill :如果您真的只想ffill每个组的最后一个NaN 您需要识别它,然后用它的ffill替换:

# is the value NaN?
m1 = df['purchase_value'].isna()

# is this the last NaN of the group?
# here: is this the first NaN of the group in reverse?
m2 = m1[::-1].groupby(df['user_id']).cumsum().eq(1)

# then replace with the ffill per group
df.loc[m1&m2, 'purchase_value'] = df.groupby(['user_id'])['purchase_value'].ffill()

Output: Output:

         date  user_id  purchase_value
0  2020-01-01        1             NaN
1  2020-01-02        1             1.0
2  2020-01-03        1             1.0
3  2020-01-04        1             4.0
4  2020-01-01        2            55.0
5  2020-01-02        2            55.0

Another possible solution:另一种可能的解决方案:

df['aux'] = (
  df.assign(aux = pd.isna(df.purchase_value))
  .groupby('user_id')['aux'].cumsum())
  
(df.assign(
    purchase_value =
    np.where((pd.isna(df.purchase_value)) & (df.aux == df.groupby('user_id')['aux']
    .transform('max')), df.purchase_value.shift(1), df.purchase_value))
    .drop('aux', axis = 1))

Output: Output:

         date  user_id  purchase_value
0  2020-01-01        1             NaN
1  2020-01-02        1             1.0
2  2020-01-03        1             1.0
3  2020-01-04        1             4.0
4  2020-01-01        2            55.0
5  2020-01-02        2            55.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM