简体   繁体   中英

Pandas: Fill missing values by mean in each group faster than transform

I need to fill missing values in a pandas DataFrame by the mean value in each group. According to this question transform can achieve this.

However, transform is too slow for my purposes.

For example, take the following setting with a large DataFrame with 100 different groups and 70% NaN values:

import pandas as pd
import numpy as np

size = 10000000  # DataFrame length
ngroups = 100  # Number of Groups

randgroups = np.random.randint(ngroups, size=size)  # Creation of groups
randvals = np.random.rand(size) * randgroups * 2    # Random values with mean like group number
nan_indices = np.random.permutation(range(size))    # NaN indices
nanfrac = 0.7                                       # Fraction of NaN values
nan_indices = nan_indices[:int(nanfrac*size)]       # Take fraction of NaN indices
randvals[nan_indices] = np.NaN                      # Set NaN values

df = pd.DataFrame({'value': randvals, 'group': randgroups})  # Create data frame

Using transform via

df.groupby("group").transform(lambda x: x.fillna(x.mean())) # Takes too long

takes already more than 3 seconds on my computer. I need something by an order of magnitude faster (buying a bigger machine is not an option :-D).

So how can I fill the missing values any faster?

you're doing it wrong. it's slow because you're using a lambda

df[['value']].fillna(df.groupby('group').transform('mean'))

Using a Sorted Index + fillna()

You are right - your code takes 3.18s to run. The code provided by @piRSquared takes 2.78s to run.

  1. Example Code : %%timeit df2 = df1.groupby("group").transform(lambda x: x.fillna(x.mean())) Output: 1 loop, best of 3: 3.18 s per loop`

  2. piRSquared's improvement : %%timeit df[['value']].fillna(df.groupby('group').transform('mean')) Output: 1 loop, best of 3: 2.78 s per loop

  3. Slightly more efficient way (using a sorted index and fillna ) :

You can set the group column as the index of the dataframe, and sort it.

df = df.set_index('group').sort_index()

Now that you have a sorted index, the it's super cheap to access a subset of the dataframe by the group number, by using df.loc[x,:]

Since you need to impute by the mean for every group, you need all the unique group id's. For this example, you could use range (since the groups are from 0 to 99), but more generally- you can use:

groups = np.unique(set(df.index))

After this, you can iterate over the groups and use fillna() for imputation: %%timeit for x in groups: df.loc[x,'value'] = df.loc[x,'value'].fillna(np.mean(df.loc[x,'value'])) Output: 1 loop, best of 3: 231 ms per loop

Note: set_index , sort_index and np.unique operations are a one time cost. To be fair to everyone, the total time (including these operations) was 2.26s on my machine, but the imputation piece took only 231 ms.

Here's a NumPy approach using np.bincount that's pretty efficient for such bin-based summing/averaging operations -

ids = df.group.values                    # Extract 2 columns as two arrays
vals = df.value.values

m = np.isnan(vals)                             # Mask of NaNs
grp_sums = np.bincount(ids,np.where(m,0,vals)) # Group sums with NaNs as 0s
avg_vals = grp_sums*(1.0/np.bincount(ids,~m))        # Group averages
vals[m] = avg_vals[ids[m]]              # Set avg values into NaN positions

Note that this would update the value column.

Runtime test

Datasizes :

size = 1000000  # DataFrame length
ngroups = 10  # Number of Groups

Timings :

In [17]: %timeit df.groupby("group").transform(lambda x: x.fillna(x.mean()))
1 loops, best of 3: 276 ms per loop

In [18]: %timeit bincount_based(df)
100 loops, best of 3: 13.6 ms per loop

In [19]: 276.0/13.6  # Speedup
Out[19]: 20.294117647058822

20x+ speedup there!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM