I have two dataframes: 'data' which contains all the data, 'peak_data' which contains the same columns, with a small selection of the rows.
I have created a column which shows the 'time_difference' between adjacent rows in 'peak_data'.
I want to label the rows (in a column called 'cycles') in 'data' with a number that changes when it gets to the next 'peak' (which is identified by a binary in the 'data' dataframe in a column 'peak') as long as the 'time_difference' in peak_data for that interval is less than 2.
A small example of the 'data' dataframe:
time pressure_1 pressure_2 ... accel_z peak cycle
0 0.000000 0.245956 0.048084 ... 0.155026 0 NaN
1 0.002000 0.245957 0.047805 ... 0.073971 0 NaN
2 0.002333 0.245984 0.047586 ... -0.056461 0 NaN
3 0.002667 0.246048 0.047464 ... 0.013302 0 NaN
4 0.003000 0.246161 0.047462 ... 0.047970 0 NaN
A small example of the 'peak_data' dataframe:
time pressure_1 pressure_2 ... accel_z peak time_difference
269 1.314 0.134094 0.036958 ... -0.160587 1.0 NaN
555 2.754 0.091645 0.032614 ... -0.514713 1.0 1.440
811 4.064 0.096233 0.049880 ... -0.433658 1.0 1.310
1057 5.300 0.094882 0.032966 ... -0.867374 1.0 1.236
1304 6.522 0.107792 0.040102 ... -0.503299 1.0 1.222
What I would like to see is this for the rows after the first peak has been identified (before that I don't want to label the data). For the next interval I'd want it to be labelled '2' and then '3' etc.
time pressure_1 pressure_2 ... accel_z peak cycle
265 1.294000 0.141472 0.033975 ... -0.027896 0 NaN
266 1.299000 0.140781 0.034691 ... -0.110416 0 NaN
267 1.304000 0.139336 0.035434 ... -0.103580 0 NaN
268 1.309000 0.137103 0.036195 ... 0.159482 0 NaN
269 1.314000 0.134094 0.036958 ... -0.160587 1 1
270 1.322000 0.130359 0.037705 ... -0.489627 0 1
271 1.329000 0.125974 0.038417 ... -0.832096 0 1
272 1.332000 0.121045 0.039078 ... -0.639713 0 1
273 1.334000 0.115730 0.039676 ... -0.565494 0 1
274 1.339000 0.110218 0.040197 ... -0.475040 0 1
This is the code that deals with the problem outlined:
data['cycle'] = np.nan
cycle_num = 1
for index, row in peak_data.iterrows():
if peak_data.loc[index,'time_difference'] == np.nan:
pass
elif peak_data.loc[index,'time_difference'] < 2:
start = peak_data.loc[index,'index']
end = peak_data.loc[index,'index']
data.loc[start : end,'cycle'] = cycle_num
cycle_num += 1
The code above gives me a KeyError: 'index', previously I've had it as 'time' and I'm just not sure why.
Is this the way I should be approaching the problem, or is there a better way? Any pointers will be much appreciated!
It looks like the integer index labels of the peak_data
DataFrame exactly correspond to the index labels of the target rows in data
. If that's always true for your full dataset, then something this should work:
# Initialize column of all nan
data['cycle'] = np.nan
# Get index labels for short (< 2 s) and
# long (>= 2 s) peaks
short_peaks = peak_data[peak_data['time_difference'] < 2].index
long_peaks = peak_data[peak_data['time_difference'] >= 2].index
# Label short-peak rows with 1, long with -1
data.loc[short_peaks, 'peak'] = 1
data.loc[long_peaks, 'peak'] = -1
#
data['cycle'] = data['peak'].cumsum()
# Hack: build a flag column that labels ALL rows
# belonging to a short peak with 1, and ALL rows
# belonging to a short peak with -1
data['flag'] = data['peak'].ffill()
# Finally, overwrite the "cycle" value with -1 for all rows
# belonging to a long peak, then replace -1 with nan
data.loc[data['flag'] == -1, 'cycle'] = data['flag'].replace(-1, np.nan)
# Drop the flag column
data = data.drop(columns='flag')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.