簡體   English   中英

用於將函數應用於 Pandas DataFrame 中的每一行的應用函數的替代方法

[英]Alternative to apply function for applying a function to each row in Pandas DataFrame

對 Pandas 中的每一行應用一個函數需要很多時間,我想找到一種更快的方法來進行一些特征工程。

這是我的數據:

0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  ... 60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99
0   2.311906    2.312835    2.315155    2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    ... -0.615403   0.204050    0.031022    -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420
1   2.312835    2.315155    2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    1.061397    ... 0.204050    0.031022    -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420   -0.243163
2   2.315155    2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    1.061397    0.975265    ... 0.031022    -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420   -0.243163   -0.186309
3   2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    1.061397    0.975265    0.965243    ... -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420   -0.243163   -0.186309   -0.211771

它有 100 列和 352953 行數據,在 colab 中運行以下代碼需要 25 分鍾。 這是我嘗試使用 apply 函數創建的功能:

x_list_fft = pd.DataFrame(Xtrain_stats).apply(lambda x: np.abs(np.fft.fft(x)))

fft_features = pd.DataFrame()

# mean
fft_features['fft_mean'] = x_list_fft.apply(lambda x: x.mean(), axis = 1)
# std dev
fft_features['fft_std'] = x_list_fft.apply(lambda x: x.std(), axis = 1)
# avg absolute diff
fft_features['fft_aad'] = x_list_fft.apply(lambda x: np.mean(np.absolute(x - np.mean(x))), axis = 1)
# min
fft_features['fft_min'] = x_list_fft.apply(lambda x: x.min(), axis = 1)
# max
fft_features['fft_max'] = x_list_fft.apply(lambda x: x.max(), axis = 1)
# max-min diff
fft_features['fft_maxmin_diff'] = fft_features['fft_max'] - fft_features['fft_min']
# median
fft_features['fft_median'] = x_list_fft.apply(lambda x: np.median(x), axis = 1)
# median abs dev 
fft_features['fft_mad'] = x_list_fft.apply(lambda x: np.median(np.absolute(x - np.median(x))), axis = 1)
# interquartile range
fft_features['fft_IQR'] = x_list_fft.apply(lambda x: np.percentile(x, 75) - np.percentile(x, 25), axis = 1)
# negtive count
fft_features['fft_neg_count'] = x_list_fft.apply(lambda x: np.sum(x < 0), axis = 1)
# positive count
fft_features['fft_pos_count'] = x_list_fft.apply(lambda x: np.sum(x > 0), axis = 1)
# values above mean
fft_features['fft_above_mean'] = x_list_fft.apply(lambda x: np.sum(x > x.mean()), axis = 1)
# number of peaks
fft_features['fft_peak_count'] = x_list_fft.apply(lambda x: len(find_peaks(x)[0]), axis = 1)
# skewness
fft_features['fft_skewness'] = x_list_fft.apply(lambda x: stats.skew(x), axis = 1)
# kurtosis
fft_features['fft_kurtosis'] = x_list_fft.apply(lambda x: stats.kurtosis(x), axis = 1)
# energy
fft_features['fft_energy'] = x_list_fft.apply(lambda x: np.sum(x**2)/100, axis = 1)
# signal magnitude area
fft_features['fft_sma'] = x_list_fft.apply(lambda x: np.sum(abs(x)/100), axis = 1)]]

那么我該怎么做才能讓它更快呢?

一種解決方案,雖然我不確定find_peaks一個,因為我無法測試它:

x_list_fft = np.fft.fft(Xtrain_stats, axis = 1)
x = x_list_fft # just to shorten the variable name since it's going to be repeated many times below
fft_features = pd.DataFrame({
    'fft_mean': x.mean(),
    'fft_std': x.std(),
    'fft_aad': np.mean(np.absolute(x - np.mean(x))),
    'fft_min': x.min(),
    'fft_max': x.max(),
    'fft_maxmin_diff': 0,
    'fft_median': np.median(x),
    'fft_mad': np.median(np.absolute(x - np.median(x))),
    'fft_IQR': np.percentile(x, 75) - np.percentile(x, 25),
    'fft_neg_count': np.sum(x < 0),
    'fft_pos_count': np.sum(x > 0),
    'fft_above_mean': np.sum(x > x.mean()),
    'fft_peak_count': len(find_peaks(x)[0]), # Might not work
    'fft_skewness': stats.skew(x),
    'fft_kurtosis': stats.kurtosis(x),
    'fft_energy': np.sum(x**2)/100,
    'fft_sma': np.sum(abs(x)/100),
})
fft_features['fft_maxmin_diff'] = fft_features['fft_max'] - fft_features['fft_min']

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM