I need to fill missing values in a pandas DataFrame by the mean value in each group. According to this question transform
can achieve this.
However, transform
is too slow for my purposes.
For example, take the following setting with a large DataFrame with 100 different groups and 70% NaN
values:
import pandas as pd
import numpy as np
size = 10000000 # DataFrame length
ngroups = 100 # Number of Groups
randgroups = np.random.randint(ngroups, size=size) # Creation of groups
randvals = np.random.rand(size) * randgroups * 2 # Random values with mean like group number
nan_indices = np.random.permutation(range(size)) # NaN indices
nanfrac = 0.7 # Fraction of NaN values
nan_indices = nan_indices[:int(nanfrac*size)] # Take fraction of NaN indices
randvals[nan_indices] = np.NaN # Set NaN values
df = pd.DataFrame({'value': randvals, 'group': randgroups}) # Create data frame
Using transform
via
df.groupby("group").transform(lambda x: x.fillna(x.mean())) # Takes too long
takes already more than 3 seconds on my computer. I need something by an order of magnitude faster (buying a bigger machine is not an option :-D).
So how can I fill the missing values any faster?
you're doing it wrong. it's slow because you're using a lambda
df[['value']].fillna(df.groupby('group').transform('mean'))
fillna()
You are right - your code takes 3.18s to run. The code provided by @piRSquared takes 2.78s to run.
Example Code : %%timeit df2 = df1.groupby("group").transform(lambda x: x.fillna(x.mean()))
Output: 1 loop, best of 3: 3.18 s per loop`
piRSquared's improvement : %%timeit df[['value']].fillna(df.groupby('group').transform('mean'))
Output: 1 loop, best of 3: 2.78 s per loop
Slightly more efficient way (using a sorted index and fillna
) :
You can set the group
column as the index of the dataframe, and sort it.
df = df.set_index('group').sort_index()
Now that you have a sorted index, the it's super cheap to access a subset of the dataframe by the group number, by using df.loc[x,:]
Since you need to impute by the mean for every group, you need all the unique group id's. For this example, you could use range
(since the groups are from 0 to 99), but more generally- you can use:
groups = np.unique(set(df.index))
After this, you can iterate over the groups and use fillna()
for imputation: %%timeit for x in groups: df.loc[x,'value'] = df.loc[x,'value'].fillna(np.mean(df.loc[x,'value']))
Output: 1 loop, best of 3: 231 ms per loop
Note: set_index
, sort_index
and np.unique
operations are a one time cost. To be fair to everyone, the total time (including these operations) was 2.26s on my machine, but the imputation piece took only 231 ms.
Here's a NumPy approach using np.bincount
that's pretty efficient for such bin-based summing/averaging operations -
ids = df.group.values # Extract 2 columns as two arrays
vals = df.value.values
m = np.isnan(vals) # Mask of NaNs
grp_sums = np.bincount(ids,np.where(m,0,vals)) # Group sums with NaNs as 0s
avg_vals = grp_sums*(1.0/np.bincount(ids,~m)) # Group averages
vals[m] = avg_vals[ids[m]] # Set avg values into NaN positions
Note that this would update the value
column.
Runtime test
Datasizes :
size = 1000000 # DataFrame length
ngroups = 10 # Number of Groups
Timings :
In [17]: %timeit df.groupby("group").transform(lambda x: x.fillna(x.mean()))
1 loops, best of 3: 276 ms per loop
In [18]: %timeit bincount_based(df)
100 loops, best of 3: 13.6 ms per loop
In [19]: 276.0/13.6 # Speedup
Out[19]: 20.294117647058822
20x+
speedup there!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.