[英]Improve performance through vectorising code and avoiding pandas apply
import pandas as pd
import numpy as np
def impute_row_median(
s: pd.Series,
threshold: float
) -> pd.Series:
'''For a vector of values, impute nans with median if %nan is below threshold'''
nan_mask = s.isna()
if nan_mask.any() and ((nan_mask.sum() / s.size) * 100) < threshold:
s_median = s.median(skipna=True)
s[nan_mask] = s_median
return s # dtype: float
df = pd.DataFrame(np.random.uniform(0, 1, size=(1000, 5)))
df = df.mask(df < 0.5)
df.apply(impute_row_median, axis=1, threshold=80) # slow
The following apply is pretty slow (I didn't use timeit since I have nothing to compare it to).下面的应用程序非常慢(我没有使用 timeit,因为我没有什么可以与之比较的)。 My usual approach would be to avoid apply and instead use vectorised functions like np.where but I can't presently manage to conceive of a way to do that here.
我通常的方法是避免应用,而是使用像 np.where 这样的向量化函数,但我目前无法在这里设法想出一种方法。 Does anyone have any suggestions?
有没有人有什么建议? Thank you!
谢谢你!
For count percentage of missing values use mean
with boolean mask, chain 2d mask
with 1d mask
in numpy by broadcasting and replace missing values in DataFrame.mask
:对于缺失值的计数百分比,使用 boolean 掩码的
mean
,通过广播将 numpy 中的2d mask
与1d mask
链接起来,并替换DataFrame.mask
中的缺失值:
threshold = 80
mask = df.isna()
m = mask.mean(axis=1) * 100 < threshold
df1 = df.mask(mask & m.to_numpy()[:, None], df.median(axis=1, skipna=True), axis=0)
Similar idea with numpy.where
:与
numpy.where
类似的想法:
mask = df.isna()
m = mask.mean(axis=1) * 100 < threshold
arr = np.where(mask & m.to_numpy()[:, None],
df.median(axis=1, skipna=True).to_numpy()[:, None],
df)
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
threshold = 80
a = df.to_numpy()
mask = np.isnan(a)
m = np.mean(mask, axis=1) * 100 < threshold
arr = np.where(mask & m[:, None], np.nanmedian(a, axis=1)[:, None], df)
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df1.equals(df.apply(impute_row_median, axis=1, threshold=80)))
True
Performance comparison (10k rows, 50 columns):性能比较(10k 行,50 列):
np.random.seed(2023)
df = pd.DataFrame(np.random.uniform(0, 1, size=(10000, 50)))
df = df.mask(df < 0.5)
In [130]: %timeit df.apply(impute_row_median, axis=1, threshold=80)
2.12 s ± 370 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [131]: %%timeit
...: a = df.to_numpy()
...:
...: mask = np.isnan(a)
...: m = np.mean(mask, axis=1) * 100 < threshold
...: arr = np.where(mask & m[:, None], np.nanmedian(a, axis=1)[:, None], df)
...:
...: df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
...:
29.5 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [132]: %%timeit
...: threshold = 80
...:
...: mask = df.isna()
...: m = mask.mean(axis=1) * 100 < threshold
...: df1 = df.mask(mask & m.to_numpy()[:, None],df.median(axis=1, skipna=True),axis=0)
...:
18.6 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [133]: %%timeit
...: mask = df.isna()
...: m = mask.mean(axis=1) * 100 < threshold
...: arr = np.where(mask & m.to_numpy()[:, None],
...: df.median(axis=1, skipna=True).to_numpy()[:, None],
...: df)
...:
...: df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
...:
...:
10.2 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.