![](/img/trans.png)
[英]assign values of one dataframe column to another dataframe column based on condition
[英]Assign a subset of rows in a dataframe with a value (in one column) based on information in another dataframe
我有两个数据框:包含所有数据的“数据”,包含相同列的“ peak_data”,对行的选择很少。
我创建了一个列,该列显示“ peak_data”中相邻行之间的“ time_difference”。
我想用一个数字标记“数据”中的行(在“周期”列中),该数字在到达下一个“峰值”时会发生变化(该值由“列”中“数据”数据框中的二进制文件标识)该时间间隔内peak_data中的“ time_difference”小于2。
“数据”数据框的一个小例子:
time pressure_1 pressure_2 ... accel_z peak cycle
0 0.000000 0.245956 0.048084 ... 0.155026 0 NaN
1 0.002000 0.245957 0.047805 ... 0.073971 0 NaN
2 0.002333 0.245984 0.047586 ... -0.056461 0 NaN
3 0.002667 0.246048 0.047464 ... 0.013302 0 NaN
4 0.003000 0.246161 0.047462 ... 0.047970 0 NaN
“ peak_data”数据帧的一个小示例:
time pressure_1 pressure_2 ... accel_z peak time_difference
269 1.314 0.134094 0.036958 ... -0.160587 1.0 NaN
555 2.754 0.091645 0.032614 ... -0.514713 1.0 1.440
811 4.064 0.096233 0.049880 ... -0.433658 1.0 1.310
1057 5.300 0.094882 0.032966 ... -0.867374 1.0 1.236
1304 6.522 0.107792 0.040102 ... -0.503299 1.0 1.222
我想看到的是确定了第一个峰之后的行(在此之前,我不想标记数据)。 对于下一个间隔,我希望将其标记为“ 2”,然后标记为“ 3”,等等。
time pressure_1 pressure_2 ... accel_z peak cycle
265 1.294000 0.141472 0.033975 ... -0.027896 0 NaN
266 1.299000 0.140781 0.034691 ... -0.110416 0 NaN
267 1.304000 0.139336 0.035434 ... -0.103580 0 NaN
268 1.309000 0.137103 0.036195 ... 0.159482 0 NaN
269 1.314000 0.134094 0.036958 ... -0.160587 1 1
270 1.322000 0.130359 0.037705 ... -0.489627 0 1
271 1.329000 0.125974 0.038417 ... -0.832096 0 1
272 1.332000 0.121045 0.039078 ... -0.639713 0 1
273 1.334000 0.115730 0.039676 ... -0.565494 0 1
274 1.339000 0.110218 0.040197 ... -0.475040 0 1
这是处理以下问题的代码:
data['cycle'] = np.nan
cycle_num = 1
for index, row in peak_data.iterrows():
if peak_data.loc[index,'time_difference'] == np.nan:
pass
elif peak_data.loc[index,'time_difference'] < 2:
start = peak_data.loc[index,'index']
end = peak_data.loc[index,'index']
data.loc[start : end,'cycle'] = cycle_num
cycle_num += 1
上面的代码给了我一个KeyError:'index',以前我把它当作'time',但我不确定为什么。
这是我应该解决问题的方式,还是有更好的方法? 任何指针将不胜感激!
它看起来像的整数索引标签peak_data
数据帧正好对应于目标行的索引标签data
。 如果对于您的完整数据集总是如此,那么应该可以这样做:
# Initialize column of all nan
data['cycle'] = np.nan
# Get index labels for short (< 2 s) and
# long (>= 2 s) peaks
short_peaks = peak_data[peak_data['time_difference'] < 2].index
long_peaks = peak_data[peak_data['time_difference'] >= 2].index
# Label short-peak rows with 1, long with -1
data.loc[short_peaks, 'peak'] = 1
data.loc[long_peaks, 'peak'] = -1
#
data['cycle'] = data['peak'].cumsum()
# Hack: build a flag column that labels ALL rows
# belonging to a short peak with 1, and ALL rows
# belonging to a short peak with -1
data['flag'] = data['peak'].ffill()
# Finally, overwrite the "cycle" value with -1 for all rows
# belonging to a long peak, then replace -1 with nan
data.loc[data['flag'] == -1, 'cycle'] = data['flag'].replace(-1, np.nan)
# Drop the flag column
data = data.drop(columns='flag')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.