通过向量化代码并避免 pandas apply 提高性能

Question

import pandas as pd
import numpy as np


def impute_row_median(
    s: pd.Series,
    threshold: float
) -> pd.Series:
    '''For a vector of values, impute nans with median if %nan is below threshold'''
    nan_mask = s.isna()
    if nan_mask.any() and ((nan_mask.sum() / s.size) * 100) < threshold:
        s_median = s.median(skipna=True)
        s[nan_mask] = s_median
    return s  # dtype: float


df = pd.DataFrame(np.random.uniform(0, 1, size=(1000, 5)))
df = df.mask(df < 0.5)
df.apply(impute_row_median, axis=1, threshold=80)  # slow

The following apply is pretty slow (I didn't use timeit since I have nothing to compare it to).下面的应用程序非常慢（我没有使用 timeit，因为我没有什么可以与之比较的）。 My usual approach would be to avoid apply and instead use vectorised functions like np.where but I can't presently manage to conceive of a way to do that here.我通常的方法是避免应用，而是使用像 np.where 这样的向量化函数，但我目前无法在这里设法想出一种方法。 Does anyone have any suggestions?有没有人有什么建议？ Thank you!谢谢你！

Answer 1

For count percentage of missing values use mean with boolean mask, chain 2d mask with 1d mask in numpy by broadcasting and replace missing values in DataFrame.mask :对于缺失值的计数百分比，使用 boolean 掩码的mean ，通过广播将 numpy 中的2d mask与1d mask链接起来，并替换DataFrame.mask中的缺失值：

threshold = 80

mask = df.isna()
m = mask.mean(axis=1) * 100 < threshold 
df1 = df.mask(mask & m.to_numpy()[:, None], df.median(axis=1, skipna=True), axis=0)

Similar idea with numpy.where :与numpy.where类似的想法：

mask = df.isna()
m = mask.mean(axis=1) * 100 < threshold
arr = np.where(mask & m.to_numpy()[:, None], 
               df.median(axis=1, skipna=True).to_numpy()[:, None], 
               df)

df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)

threshold = 80

a = df.to_numpy()

mask = np.isnan(a)
m = np.mean(mask, axis=1) * 100 < threshold
arr = np.where(mask & m[:, None], np.nanmedian(a, axis=1)[:, None], df)

df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)

print (df1.equals(df.apply(impute_row_median, axis=1, threshold=80)))
True

Performance comparison (10k rows, 50 columns):性能比较（10k 行，50 列）：

np.random.seed(2023)
df = pd.DataFrame(np.random.uniform(0, 1, size=(10000, 50)))
df = df.mask(df < 0.5)

In [130]: %timeit df.apply(impute_row_median, axis=1, threshold=80)
2.12 s ± 370 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [131]: %%timeit
     ...: a = df.to_numpy()
     ...: 
     ...: mask = np.isnan(a)
     ...: m = np.mean(mask, axis=1) * 100 < threshold
     ...: arr = np.where(mask & m[:, None], np.nanmedian(a, axis=1)[:, None], df)
     ...: 
     ...: df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
     ...: 
29.5 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [132]: %%timeit
     ...: threshold = 80
     ...: 
     ...: mask = df.isna()
     ...: m = mask.mean(axis=1) * 100 < threshold 
     ...: df1 = df.mask(mask & m.to_numpy()[:, None],df.median(axis=1, skipna=True),axis=0)
     ...: 
18.6 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [133]: %%timeit
     ...: mask = df.isna()
     ...: m = mask.mean(axis=1) * 100 < threshold
     ...: arr = np.where(mask & m.to_numpy()[:, None], 
     ...:                df.median(axis=1, skipna=True).to_numpy()[:, None], 
     ...:                df)
     ...: 
     ...: df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
     ...: 
     ...: 
10.2 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

通过向量化代码并避免 pandas apply 提高性能

问题描述

1 个解决方案

解决方案1
2 已采纳 2023-01-30 11:24:35

通过向量化代码并避免 pandas apply 提高性能

问题描述

1 个解决方案

解决方案1 2 已采纳 2023-01-30 11:24:35

解决方案1
2 已采纳 2023-01-30 11:24:35