用于将函数应用于 Pandas DataFrame 中的每一行的应用函数的替代方法

Question

对 Pandas 中的每一行应用一个函数需要很多时间，我想找到一种更快的方法来进行一些特征工程。

这是我的数据：

0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  ... 60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99
0   2.311906    2.312835    2.315155    2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    ... -0.615403   0.204050    0.031022    -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420
1   2.312835    2.315155    2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    1.061397    ... 0.204050    0.031022    -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420   -0.243163
2   2.315155    2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    1.061397    0.975265    ... 0.031022    -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420   -0.243163   -0.186309
3   2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    1.061397    0.975265    0.965243    ... -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420   -0.243163   -0.186309   -0.211771

它有 100 列和 352953 行数据，在 colab 中运行以下代码需要 25 分钟。 这是我尝试使用 apply 函数创建的功能：

x_list_fft = pd.DataFrame(Xtrain_stats).apply(lambda x: np.abs(np.fft.fft(x)))

fft_features = pd.DataFrame()

# mean
fft_features['fft_mean'] = x_list_fft.apply(lambda x: x.mean(), axis = 1)
# std dev
fft_features['fft_std'] = x_list_fft.apply(lambda x: x.std(), axis = 1)
# avg absolute diff
fft_features['fft_aad'] = x_list_fft.apply(lambda x: np.mean(np.absolute(x - np.mean(x))), axis = 1)
# min
fft_features['fft_min'] = x_list_fft.apply(lambda x: x.min(), axis = 1)
# max
fft_features['fft_max'] = x_list_fft.apply(lambda x: x.max(), axis = 1)
# max-min diff
fft_features['fft_maxmin_diff'] = fft_features['fft_max'] - fft_features['fft_min']
# median
fft_features['fft_median'] = x_list_fft.apply(lambda x: np.median(x), axis = 1)
# median abs dev 
fft_features['fft_mad'] = x_list_fft.apply(lambda x: np.median(np.absolute(x - np.median(x))), axis = 1)
# interquartile range
fft_features['fft_IQR'] = x_list_fft.apply(lambda x: np.percentile(x, 75) - np.percentile(x, 25), axis = 1)
# negtive count
fft_features['fft_neg_count'] = x_list_fft.apply(lambda x: np.sum(x < 0), axis = 1)
# positive count
fft_features['fft_pos_count'] = x_list_fft.apply(lambda x: np.sum(x > 0), axis = 1)
# values above mean
fft_features['fft_above_mean'] = x_list_fft.apply(lambda x: np.sum(x > x.mean()), axis = 1)
# number of peaks
fft_features['fft_peak_count'] = x_list_fft.apply(lambda x: len(find_peaks(x)[0]), axis = 1)
# skewness
fft_features['fft_skewness'] = x_list_fft.apply(lambda x: stats.skew(x), axis = 1)
# kurtosis
fft_features['fft_kurtosis'] = x_list_fft.apply(lambda x: stats.kurtosis(x), axis = 1)
# energy
fft_features['fft_energy'] = x_list_fft.apply(lambda x: np.sum(x**2)/100, axis = 1)
# signal magnitude area
fft_features['fft_sma'] = x_list_fft.apply(lambda x: np.sum(abs(x)/100), axis = 1)]]

那么我该怎么做才能让它更快呢？

Answer 1

一种解决方案，虽然我不确定find_peaks一个，因为我无法测试它：

x_list_fft = np.fft.fft(Xtrain_stats, axis = 1)
x = x_list_fft # just to shorten the variable name since it's going to be repeated many times below
fft_features = pd.DataFrame({
    'fft_mean': x.mean(),
    'fft_std': x.std(),
    'fft_aad': np.mean(np.absolute(x - np.mean(x))),
    'fft_min': x.min(),
    'fft_max': x.max(),
    'fft_maxmin_diff': 0,
    'fft_median': np.median(x),
    'fft_mad': np.median(np.absolute(x - np.median(x))),
    'fft_IQR': np.percentile(x, 75) - np.percentile(x, 25),
    'fft_neg_count': np.sum(x < 0),
    'fft_pos_count': np.sum(x > 0),
    'fft_above_mean': np.sum(x > x.mean()),
    'fft_peak_count': len(find_peaks(x)[0]), # Might not work
    'fft_skewness': stats.skew(x),
    'fft_kurtosis': stats.kurtosis(x),
    'fft_energy': np.sum(x**2)/100,
    'fft_sma': np.sum(abs(x)/100),
})
fft_features['fft_maxmin_diff'] = fft_features['fft_max'] - fft_features['fft_min']

用于将函数应用于 Pandas DataFrame 中的每一行的应用函数的替代方法

问题描述

1 个解决方案

解决方案1
0 2021-11-12 19:14:05

用于将函数应用于 Pandas DataFrame 中的每一行的应用函数的替代方法

问题描述

1 个解决方案

解决方案1 0 2021-11-12 19:14:05

解决方案1
0 2021-11-12 19:14:05