[英]How to fill last non-null value for each user in pandas?
I have a df with user journeys that show purchase amounts of products.我有一个 df,其中包含显示产品购买量的用户旅程。 Now, I want to fill the last non-null value for each user, since users do not buy every day.
现在,我想为每个用户填写最后一个非空值,因为用户不会每天都购买。 currently, I have:
目前,我有:
date | user_id | purchase_value
2020-01-01 | 1 | null
2020-01-02 | 1 | 1
2020-01-03 | 1 | null
2020-01-04 | 1 | 4
2020-01-01 | 2 | 55
2020-01-02 | 2 | null
I want it to look like this:我希望它看起来像这样:
date | user_id | purchase_value
2020-01-01 | 1 | null
2020-01-02 | 1 | 1
2020-01-03 | 1 | 1
2020-01-04 | 1 | 4
2020-01-01 | 2 | 55
2020-01-02 | 2 | 55
Explanation: For user 1, we fill 1 on 2020-01-03 since this was the last non-null value on 2020-01-02.说明:对于用户 1,我们在 2020 年 1 月 3 日填写 1,因为这是 2020 年 1 月 2 日的最后一个非空值。 For user 2, we fill in 55 on 2020-01-02 since this was the last non-null value on 2020-01-01.
对于用户 2,我们在 2020 年 1 月 2 日填写 55,因为这是 2020 年 1 月 1 日的最后一个非空值。
How would I do this in pandas for each user_id and date?对于每个 user_id 和日期,我将如何在 pandas 中执行此操作? Also, the dates do not have to be sequential.
此外,日期不必是连续的。 ie there can be gaps in the dates, in that case always fill in the last non-null value (whenever that was).
即日期中可能存在空白,在这种情况下,请始终填写最后一个非空值(无论何时)。
If you really want to ffill
only the last NaN per group you need to identify it, then replace with its ffill
:如果您真的只想
ffill
每个组的最后一个NaN 您需要识别它,然后用它的ffill
替换:
# is the value NaN?
m1 = df['purchase_value'].isna()
# is this the last NaN of the group?
# here: is this the first NaN of the group in reverse?
m2 = m1[::-1].groupby(df['user_id']).cumsum().eq(1)
# then replace with the ffill per group
df.loc[m1&m2, 'purchase_value'] = df.groupby(['user_id'])['purchase_value'].ffill()
Output: Output:
date user_id purchase_value
0 2020-01-01 1 NaN
1 2020-01-02 1 1.0
2 2020-01-03 1 1.0
3 2020-01-04 1 4.0
4 2020-01-01 2 55.0
5 2020-01-02 2 55.0
Another possible solution:另一种可能的解决方案:
df['aux'] = (
df.assign(aux = pd.isna(df.purchase_value))
.groupby('user_id')['aux'].cumsum())
(df.assign(
purchase_value =
np.where((pd.isna(df.purchase_value)) & (df.aux == df.groupby('user_id')['aux']
.transform('max')), df.purchase_value.shift(1), df.purchase_value))
.drop('aux', axis = 1))
Output: Output:
date user_id purchase_value
0 2020-01-01 1 NaN
1 2020-01-02 1 1.0
2 2020-01-03 1 1.0
3 2020-01-04 1 4.0
4 2020-01-01 2 55.0
5 2020-01-02 2 55.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.