简体   繁体   中英

Efficient solution for forward filling missing values in a pandas dataframe column?

I need to forward fill values in a column of a dataframe within groups. I should note that the first value in a group is never missing by construction. I have the following solutions at the moment.

df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, 2, np.nan, np.nan]})

# desired output
a   b
1   1
1   1
2   2
2   2
2   2

Here are the three solutions that I've tried so far.

# really slow solutions
df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
df['b'] = df.groupby('a')['b'].fillna(method='ffill')

# much faster solution, but more memory intensive and ugly all around
tmp = df.drop_duplicates('a', keep='first')
df.drop('b', inplace=True, axis=1)
df = df.merge(tmp, on='a')

All three of these produce my desired output, but the first two take a really long time on my data set, and the third solution is more memory intensive and feels rather clunky. Are there any other ways to forward fill a column?

Using ffill() directly will give the best results. Here is the comparison

%timeit df.b.ffill(inplace = True)
best of 3: 311 µs per loop

%timeit df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
best of 3: 2.34 ms per loop

%timeit df['b'] = df.groupby('a')['b'].fillna(method='ffill')
best of 3: 4.41 ms per loop

You need to sort by both columns df.sort_values(['a', 'b']).ffill() to ensure robustness. If an np.nan is left in the first position within a group, ffill will fill that with a value from the prior group. Because np.nan will be placed at the end of any sort, sorting by both a and b ensures that you will not have np.nan at the front of any group. You can then .loc or .reindex with the initial index to get back your original order.

This will obviously be a tad slower than the other proposals... However, I contend it will be correct where the others are not.

demo

Consider the dataframe df

df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, np.nan, 2, np.nan]})

print(df)

   a    b
0  1  1.0
1  1  NaN
2  2  NaN
3  2  2.0
4  2  NaN

Try

df.sort_values('a').ffill()

   a    b
0  1  1.0
1  1  1.0
2  2  1.0  # <--- this is incorrect
3  2  2.0
4  2  2.0

Instead do

df.sort_values(['a', 'b']).ffill().loc[df.index]

   a    b
0  1  1.0
1  1  1.0
2  2  2.0
3  2  2.0
4  2  2.0

special note
This is still incorrect if an entire group has missing values

那这个呢

df.groupby('a').b.transform('ffill')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM