简体   繁体   English

通过向量化代码并避免 pandas apply 提高性能

[英]Improve performance through vectorising code and avoiding pandas apply

import pandas as pd
import numpy as np


def impute_row_median(
    s: pd.Series,
    threshold: float
) -> pd.Series:
    '''For a vector of values, impute nans with median if %nan is below threshold'''
    nan_mask = s.isna()
    if nan_mask.any() and ((nan_mask.sum() / s.size) * 100) < threshold:
        s_median = s.median(skipna=True)
        s[nan_mask] = s_median
    return s  # dtype: float


df = pd.DataFrame(np.random.uniform(0, 1, size=(1000, 5)))
df = df.mask(df < 0.5)
df.apply(impute_row_median, axis=1, threshold=80)  # slow

The following apply is pretty slow (I didn't use timeit since I have nothing to compare it to).下面的应用程序非常慢(我没有使用 timeit,因为我没有什么可以与之比较的)。 My usual approach would be to avoid apply and instead use vectorised functions like np.where but I can't presently manage to conceive of a way to do that here.我通常的方法是避免应用,而是使用像 np.where 这样的向量化函数,但我目前无法在这里设法想出一种方法。 Does anyone have any suggestions?有没有人有什么建议? Thank you!谢谢你!

For count percentage of missing values use mean with boolean mask, chain 2d mask with 1d mask in numpy by broadcasting and replace missing values in DataFrame.mask :对于缺失值的计数百分比,使用 boolean 掩码的mean ,通过广播将 numpy 中的2d mask1d mask链接起来,并替换DataFrame.mask中的缺失值:

threshold = 80

mask = df.isna()
m = mask.mean(axis=1) * 100 < threshold 
df1 = df.mask(mask & m.to_numpy()[:, None], df.median(axis=1, skipna=True), axis=0)

Similar idea with numpy.where :numpy.where类似的想法:

mask = df.isna()
m = mask.mean(axis=1) * 100 < threshold
arr = np.where(mask & m.to_numpy()[:, None], 
               df.median(axis=1, skipna=True).to_numpy()[:, None], 
               df)

df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)

threshold = 80

a = df.to_numpy()

mask = np.isnan(a)
m = np.mean(mask, axis=1) * 100 < threshold
arr = np.where(mask & m[:, None], np.nanmedian(a, axis=1)[:, None], df)

df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)

print (df1.equals(df.apply(impute_row_median, axis=1, threshold=80)))
True

Performance comparison (10k rows, 50 columns):性能比较(10k 行,50 列):

np.random.seed(2023)
df = pd.DataFrame(np.random.uniform(0, 1, size=(10000, 50)))
df = df.mask(df < 0.5)

In [130]: %timeit df.apply(impute_row_median, axis=1, threshold=80)
2.12 s ± 370 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [131]: %%timeit
     ...: a = df.to_numpy()
     ...: 
     ...: mask = np.isnan(a)
     ...: m = np.mean(mask, axis=1) * 100 < threshold
     ...: arr = np.where(mask & m[:, None], np.nanmedian(a, axis=1)[:, None], df)
     ...: 
     ...: df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
     ...: 
29.5 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [132]: %%timeit
     ...: threshold = 80
     ...: 
     ...: mask = df.isna()
     ...: m = mask.mean(axis=1) * 100 < threshold 
     ...: df1 = df.mask(mask & m.to_numpy()[:, None],df.median(axis=1, skipna=True),axis=0)
     ...: 
18.6 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [133]: %%timeit
     ...: mask = df.isna()
     ...: m = mask.mean(axis=1) * 100 < threshold
     ...: arr = np.where(mask & m.to_numpy()[:, None], 
     ...:                df.median(axis=1, skipna=True).to_numpy()[:, None], 
     ...:                df)
     ...: 
     ...: df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
     ...: 
     ...: 
10.2 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM