简体   繁体   中英

Alternative to apply function for applying a function to each row in Pandas DataFrame

Applying a function to each row in Pandas takes so much time and I would like to find a faster way to do some feature engineering.

Here is my data:

0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  ... 60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99
0   2.311906    2.312835    2.315155    2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    ... -0.615403   0.204050    0.031022    -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420
1   2.312835    2.315155    2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    1.061397    ... 0.204050    0.031022    -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420   -0.243163
2   2.315155    2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    1.061397    0.975265    ... 0.031022    -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420   -0.243163   -0.186309
3   2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    1.061397    0.975265    0.965243    ... -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420   -0.243163   -0.186309   -0.211771

It has 100 columns and 352953 rows of data and running the below code takes 25 minutes in colab. And here are my features I'm trying to create with apply function:

x_list_fft = pd.DataFrame(Xtrain_stats).apply(lambda x: np.abs(np.fft.fft(x)))

fft_features = pd.DataFrame()

# mean
fft_features['fft_mean'] = x_list_fft.apply(lambda x: x.mean(), axis = 1)
# std dev
fft_features['fft_std'] = x_list_fft.apply(lambda x: x.std(), axis = 1)
# avg absolute diff
fft_features['fft_aad'] = x_list_fft.apply(lambda x: np.mean(np.absolute(x - np.mean(x))), axis = 1)
# min
fft_features['fft_min'] = x_list_fft.apply(lambda x: x.min(), axis = 1)
# max
fft_features['fft_max'] = x_list_fft.apply(lambda x: x.max(), axis = 1)
# max-min diff
fft_features['fft_maxmin_diff'] = fft_features['fft_max'] - fft_features['fft_min']
# median
fft_features['fft_median'] = x_list_fft.apply(lambda x: np.median(x), axis = 1)
# median abs dev 
fft_features['fft_mad'] = x_list_fft.apply(lambda x: np.median(np.absolute(x - np.median(x))), axis = 1)
# interquartile range
fft_features['fft_IQR'] = x_list_fft.apply(lambda x: np.percentile(x, 75) - np.percentile(x, 25), axis = 1)
# negtive count
fft_features['fft_neg_count'] = x_list_fft.apply(lambda x: np.sum(x < 0), axis = 1)
# positive count
fft_features['fft_pos_count'] = x_list_fft.apply(lambda x: np.sum(x > 0), axis = 1)
# values above mean
fft_features['fft_above_mean'] = x_list_fft.apply(lambda x: np.sum(x > x.mean()), axis = 1)
# number of peaks
fft_features['fft_peak_count'] = x_list_fft.apply(lambda x: len(find_peaks(x)[0]), axis = 1)
# skewness
fft_features['fft_skewness'] = x_list_fft.apply(lambda x: stats.skew(x), axis = 1)
# kurtosis
fft_features['fft_kurtosis'] = x_list_fft.apply(lambda x: stats.kurtosis(x), axis = 1)
# energy
fft_features['fft_energy'] = x_list_fft.apply(lambda x: np.sum(x**2)/100, axis = 1)
# signal magnitude area
fft_features['fft_sma'] = x_list_fft.apply(lambda x: np.sum(abs(x)/100), axis = 1)]]

So what can I do to make it faster?

One solution, although I'm not sure about the find_peaks one because I can't test it:

x_list_fft = np.fft.fft(Xtrain_stats, axis = 1)
x = x_list_fft # just to shorten the variable name since it's going to be repeated many times below
fft_features = pd.DataFrame({
    'fft_mean': x.mean(),
    'fft_std': x.std(),
    'fft_aad': np.mean(np.absolute(x - np.mean(x))),
    'fft_min': x.min(),
    'fft_max': x.max(),
    'fft_maxmin_diff': 0,
    'fft_median': np.median(x),
    'fft_mad': np.median(np.absolute(x - np.median(x))),
    'fft_IQR': np.percentile(x, 75) - np.percentile(x, 25),
    'fft_neg_count': np.sum(x < 0),
    'fft_pos_count': np.sum(x > 0),
    'fft_above_mean': np.sum(x > x.mean()),
    'fft_peak_count': len(find_peaks(x)[0]), # Might not work
    'fft_skewness': stats.skew(x),
    'fft_kurtosis': stats.kurtosis(x),
    'fft_energy': np.sum(x**2)/100,
    'fft_sma': np.sum(abs(x)/100),
})
fft_features['fft_maxmin_diff'] = fft_features['fft_max'] - fft_features['fft_min']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM