简体   繁体   中英

How do you filter rows in a dataframe based on the column numbers from a Python list?

I have a Pandas dataframe with two columns, x and y, that correspond to a large signal. It is about 3 million rows in size.

Wavelength from dataframe

I am trying to isolate the peaks from the signal. After using scipy, I got a 1D Python list corresponding to the indexes of the peaks. However, they are not the actual x-values of the signal, but just the index of their corresponding row:

from scipy.signal import find_peaks
peaks, _ = find_peaks(y, height=(None, peakline))

So, I decided I would just filter the original dataframe by setting all values in its y column to NaN unless they were on an index found in the peak list. I did this iteratively, however, since it is 3000000 rows, it is extremely slow:

peak_index = 0
for data_index in list(data.index):
    if data_index != peaks[peak_index]:
        data[data_index, 1] = float('NaN')
    else:
        peak_index += 1

Does anyone know what a faster method of filtering a Pandas dataframe might be?

Looping in most cases is extremely inefficient when it comes to pandas. Assuming you just need filtered DataFrame that contains the values of both x and y columns only when y is a peak, you may use the following piece of code:

df.iloc[peaks]

Alternatively, if you are hoping to retrieve an original DataFrame with y column retaining its peak values and having NaN otherwise, then please use:

df.y = df.y.where(df.y.iloc[peaks] == df.y.iloc[peaks])

Finally, since you seem to care about just the x values of the peaks, you might just rework the first piece in the following way:

df.iloc[peaks].x

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM