[英]Pandas fill incremental values for NA's according to another column in the DataFrame
I have a dataframe with sessions for each user.我有一个 dataframe ,每个用户都有会话。 One of the column is sessions till now.其中一个专栏是迄今为止的会议。 Some of these sessions have null values.其中一些会话具有 null 值。 I believe I could use fillna and transform methods to appropriately fill the dataframe.我相信我可以使用 fillna 和 transform 方法来适当地填充 dataframe。
import pandas as pd
df = pd.DataFrame({'user': [A, A, A, A, A, B, B, B, B, C, C, C, C, C], 'sessions': [28, NaN, NaN, NaN , 32, NaN, NaN,NaN,12, NaN,15, NaN, 17,NaN]})
Expected Output DataFrame:预期 Output DataFrame:
df_out = pd.DataFrame({'user': [A, A, A, A, A, B, B, B, B, C, C, C, C, C], 'sessions': [28, 29, 30, 31 , 32, 9, 10, 11,12, 14,15,16,17,18]})
Tried Code:试过的代码:
df['sessions'] = df['sessions'].fillna(df.groupby('user')['sessions'].transform('mean'))
this works if I were to fill mean and this is as far as I could think.如果我要填补平均数,这是可行的,这是我所能想到的。 Please suggest a few approaches.请提出一些方法。
PS - The starting value of the session is not 1. I am doing it from a snapshot at some point of time. PS - session 的起始值不是 1。我在某个时间点从快照中执行此操作。 I do not have data going back till session number 1 for every user.我没有数据可以追溯到每个用户的 session 编号 1。
Assuming there is no mismatch between the not NaN
values, you could do the following:假设非NaN
值之间没有不匹配,您可以执行以下操作:
def fun(x):
_, diff = (~x.reset_index().isna()).idxmax() # find the absolute position of the first non NaN
start = x[(~x.isna()).idxmax()] - diff # find the start value
result = pd.RangeIndex(start, start + len(x)) # generate range based on first value and length of group
return pd.Series(data=result.values, index=x.index) # return series
df['count'] = df.groupby('user').sessions.apply(fun)
print(df)
Output Output
user sessions count
0 A 28.0 28
1 A NaN 29
2 A NaN 30
3 A NaN 31
4 A 32.0 32
5 B NaN 9
6 B NaN 10
7 B NaN 11
8 B 12.0 12
9 C NaN 14
10 C 15.0 15
11 C NaN 16
12 C 17.0 17
13 C NaN 18
The first line of the function fun
, is equivalent to: function fun
的第一行,相当于:
diff = (~x.reset_index().isna()).idxmax()[1]
Basically find the index position in a normalized (starting from 0) index.基本上在归一化(从 0 开始)索引中找到索引 position。
Use cumsum
with fillna(1)
for each group:对每个组使用cumsum
和fillna(1)
:
df.groupby('user',sort=False)['sessions'].apply(lambda x: x.fillna(1).cumsum()).reset_index()
You may re-construct sessions
by using groupby cumcount
and first
您可以使用 groupby cumcount
和first
重新构建sessions
s = df.groupby('user').sessions.cumcount()
s1 = (df.sessions - s).groupby(df.user).transform('first')
df['sessions'] = s1 + s
In [867]: df
Out[867]:
user sessions
0 A 28.0
1 A 29.0
2 A 30.0
3 A 31.0
4 A 32.0
5 B 9.0
6 B 10.0
7 B 11.0
8 B 12.0
9 C 14.0
10 C 15.0
11 C 16.0
12 C 17.0
13 C 18.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.