Alternative to apply function for applying a function to each row in Pandas DataFrame

Question

Applying a function to each row in Pandas takes so much time and I would like to find a faster way to do some feature engineering.

Here is my data:

0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  ... 60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99
0   2.311906    2.312835    2.315155    2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    ... -0.615403   0.204050    0.031022    -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420
1   2.312835    2.315155    2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    1.061397    ... 0.204050    0.031022    -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420   -0.243163
2   2.315155    2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    1.061397    0.975265    ... 0.031022    -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420   -0.243163   -0.186309
3   2.316544    2.315387    2.386366    2.920247    3.545590    2.816790    2.253111    2.260460    2.282920    2.311673    2.344402    2.375047    2.395634    2.547970    2.990950    3.029762    3.052723    2.523181    2.362719    2.377015    2.390256    2.354047    2.358948    2.656481    3.007007    2.500126    2.335350    2.338300    2.338754    2.335804    2.427132    2.341017    2.352707    1.882517    1.188852    1.192237    1.192237    1.061397    0.975265    0.965243    ... -0.596637   -0.890989   0.828287    2.068090    2.065448    2.075396    2.077142    2.079175    2.082072    2.062504    2.070433    2.065153    2.066916    2.068090    2.069848    2.067503    2.059551    2.080045    1.232675    -0.630660   0.078983    0.084827    0.078983    0.078983    0.080934    0.084827    1.950281    2.164140    0.952579    -0.094829   -0.311746   -0.222131   -0.216938   -0.149240   -0.209196   -0.211771   -0.196420   -0.243163   -0.186309   -0.211771

It has 100 columns and 352953 rows of data and running the below code takes 25 minutes in colab. And here are my features I'm trying to create with apply function:

x_list_fft = pd.DataFrame(Xtrain_stats).apply(lambda x: np.abs(np.fft.fft(x)))

fft_features = pd.DataFrame()

# mean
fft_features['fft_mean'] = x_list_fft.apply(lambda x: x.mean(), axis = 1)
# std dev
fft_features['fft_std'] = x_list_fft.apply(lambda x: x.std(), axis = 1)
# avg absolute diff
fft_features['fft_aad'] = x_list_fft.apply(lambda x: np.mean(np.absolute(x - np.mean(x))), axis = 1)
# min
fft_features['fft_min'] = x_list_fft.apply(lambda x: x.min(), axis = 1)
# max
fft_features['fft_max'] = x_list_fft.apply(lambda x: x.max(), axis = 1)
# max-min diff
fft_features['fft_maxmin_diff'] = fft_features['fft_max'] - fft_features['fft_min']
# median
fft_features['fft_median'] = x_list_fft.apply(lambda x: np.median(x), axis = 1)
# median abs dev 
fft_features['fft_mad'] = x_list_fft.apply(lambda x: np.median(np.absolute(x - np.median(x))), axis = 1)
# interquartile range
fft_features['fft_IQR'] = x_list_fft.apply(lambda x: np.percentile(x, 75) - np.percentile(x, 25), axis = 1)
# negtive count
fft_features['fft_neg_count'] = x_list_fft.apply(lambda x: np.sum(x < 0), axis = 1)
# positive count
fft_features['fft_pos_count'] = x_list_fft.apply(lambda x: np.sum(x > 0), axis = 1)
# values above mean
fft_features['fft_above_mean'] = x_list_fft.apply(lambda x: np.sum(x > x.mean()), axis = 1)
# number of peaks
fft_features['fft_peak_count'] = x_list_fft.apply(lambda x: len(find_peaks(x)[0]), axis = 1)
# skewness
fft_features['fft_skewness'] = x_list_fft.apply(lambda x: stats.skew(x), axis = 1)
# kurtosis
fft_features['fft_kurtosis'] = x_list_fft.apply(lambda x: stats.kurtosis(x), axis = 1)
# energy
fft_features['fft_energy'] = x_list_fft.apply(lambda x: np.sum(x**2)/100, axis = 1)
# signal magnitude area
fft_features['fft_sma'] = x_list_fft.apply(lambda x: np.sum(abs(x)/100), axis = 1)]]

So what can I do to make it faster?

Answer 1

One solution, although I'm not sure about the find_peaks one because I can't test it:

x_list_fft = np.fft.fft(Xtrain_stats, axis = 1)
x = x_list_fft # just to shorten the variable name since it's going to be repeated many times below
fft_features = pd.DataFrame({
    'fft_mean': x.mean(),
    'fft_std': x.std(),
    'fft_aad': np.mean(np.absolute(x - np.mean(x))),
    'fft_min': x.min(),
    'fft_max': x.max(),
    'fft_maxmin_diff': 0,
    'fft_median': np.median(x),
    'fft_mad': np.median(np.absolute(x - np.median(x))),
    'fft_IQR': np.percentile(x, 75) - np.percentile(x, 25),
    'fft_neg_count': np.sum(x < 0),
    'fft_pos_count': np.sum(x > 0),
    'fft_above_mean': np.sum(x > x.mean()),
    'fft_peak_count': len(find_peaks(x)[0]), # Might not work
    'fft_skewness': stats.skew(x),
    'fft_kurtosis': stats.kurtosis(x),
    'fft_energy': np.sum(x**2)/100,
    'fft_sma': np.sum(abs(x)/100),
})
fft_features['fft_maxmin_diff'] = fft_features['fft_max'] - fft_features['fft_min']

Alternative to apply function for applying a function to each row in Pandas DataFrame

Question

1 answers

solution1
0 2021-11-12 19:14:05

Alternative to apply function for applying a function to each row in Pandas DataFrame

Question

1 answers

solution1 0 2021-11-12 19:14:05

solution1
0 2021-11-12 19:14:05