简体   繁体   English

根据另一个数据框中的信息为一个数据框中的行子集分配一个值(在一个列中)

[英]Assign a subset of rows in a dataframe with a value (in one column) based on information in another dataframe

I have two dataframes: 'data' which contains all the data, 'peak_data' which contains the same columns, with a small selection of the rows. 我有两个数据框:包含所有数据的“数据”,包含相同列的“ peak_data”,对行的选择很少。

I have created a column which shows the 'time_difference' between adjacent rows in 'peak_data'. 我创建了一个列,该列显示“ peak_data”中相邻行之间的“ time_difference”。

I want to label the rows (in a column called 'cycles') in 'data' with a number that changes when it gets to the next 'peak' (which is identified by a binary in the 'data' dataframe in a column 'peak') as long as the 'time_difference' in peak_data for that interval is less than 2. 我想用一个数字标记“数据”中的行(在“周期”列中),该数字在到达下一个“峰值”时会发生变化(该值由“列”中“数据”数据框中的二进制文件标识)该时间间隔内peak_data中的“ time_difference”小于2。

A small example of the 'data' dataframe: “数据”数据框的一个小例子:

       time  pressure_1  pressure_2  ...   accel_z  peak  cycle
0  0.000000    0.245956    0.048084  ...  0.155026     0    NaN
1  0.002000    0.245957    0.047805  ...  0.073971     0    NaN
2  0.002333    0.245984    0.047586  ... -0.056461     0    NaN
3  0.002667    0.246048    0.047464  ...  0.013302     0    NaN
4  0.003000    0.246161    0.047462  ...  0.047970     0    NaN

A small example of the 'peak_data' dataframe: “ peak_data”数据帧的一个小示例:

       time  pressure_1  pressure_2  ...   accel_z  peak  time_difference
269   1.314    0.134094    0.036958  ... -0.160587   1.0              NaN
555   2.754    0.091645    0.032614  ... -0.514713   1.0            1.440
811   4.064    0.096233    0.049880  ... -0.433658   1.0            1.310
1057  5.300    0.094882    0.032966  ... -0.867374   1.0            1.236
1304  6.522    0.107792    0.040102  ... -0.503299   1.0            1.222

What I would like to see is this for the rows after the first peak has been identified (before that I don't want to label the data). 我想看到的是确定了第一个峰之后的行(在此之前,我不想标记数据)。 For the next interval I'd want it to be labelled '2' and then '3' etc. 对于下一个间隔,我希望将其标记为“ 2”,然后标记为“ 3”,等等。

       time    pressure_1  pressure_2  ...   accel_z  peak  cycle
265  1.294000    0.141472    0.033975  ... -0.027896     0    NaN
266  1.299000    0.140781    0.034691  ... -0.110416     0    NaN
267  1.304000    0.139336    0.035434  ... -0.103580     0    NaN
268  1.309000    0.137103    0.036195  ...  0.159482     0    NaN
269  1.314000    0.134094    0.036958  ... -0.160587     1    1
270  1.322000    0.130359    0.037705  ... -0.489627     0    1
271  1.329000    0.125974    0.038417  ... -0.832096     0    1
272  1.332000    0.121045    0.039078  ... -0.639713     0    1
273  1.334000    0.115730    0.039676  ... -0.565494     0    1
274  1.339000    0.110218    0.040197  ... -0.475040     0    1

This is the code that deals with the problem outlined: 这是处理以下问题的代码:

data['cycle'] = np.nan

cycle_num = 1

for index, row in peak_data.iterrows():        
    if peak_data.loc[index,'time_difference'] == np.nan:
        pass
    elif peak_data.loc[index,'time_difference'] < 2:
        start = peak_data.loc[index,'index'] 
        end = peak_data.loc[index,'index']
        data.loc[start : end,'cycle'] = cycle_num
        cycle_num += 1

The code above gives me a KeyError: 'index', previously I've had it as 'time' and I'm just not sure why. 上面的代码给了我一个KeyError:'index',以前我把它当作'time',但我不确定为什么。

Is this the way I should be approaching the problem, or is there a better way? 这是我应该解决问题的方式,还是有更好的方法? Any pointers will be much appreciated! 任何指针将不胜感激!

It looks like the integer index labels of the peak_data DataFrame exactly correspond to the index labels of the target rows in data . 它看起来像的整数索引标签peak_data数据帧正好对应于目标行的索引标签data If that's always true for your full dataset, then something this should work: 如果对于您的完整数据集总是如此,那么应该可以这样做:

# Initialize column of all nan
data['cycle'] = np.nan

# Get index labels for short (< 2 s) and
# long (>= 2 s) peaks
short_peaks = peak_data[peak_data['time_difference'] < 2].index
long_peaks = peak_data[peak_data['time_difference'] >= 2].index

# Label short-peak rows with 1, long with -1
data.loc[short_peaks, 'peak'] = 1
data.loc[long_peaks, 'peak'] = -1

# 
data['cycle'] = data['peak'].cumsum()

# Hack: build a flag column that labels ALL rows
# belonging to a short peak with 1, and ALL rows
# belonging to a short peak with -1
data['flag'] = data['peak'].ffill()

# Finally, overwrite the "cycle" value with -1 for all rows
# belonging to a long peak, then replace -1 with nan
data.loc[data['flag'] == -1, 'cycle'] = data['flag'].replace(-1, np.nan)

# Drop the flag column
data = data.drop(columns='flag')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据条件将一个 dataframe 列的值分配给另一个 dataframe 列 - assign values of one dataframe column to another dataframe column based on condition 如何根据 Pandas 中的条件为 dataframe 子集的列分配值? - How to assign a value to a column for a subset of dataframe based on a condition in Pandas? 根据其他列中的值,在列中 dataframe 行的子集上应用 function - apply function on subset of dataframe rows in column based on value in other column PySpark:根据列条件使用来自另一个行的行创建子集数据框 - PySpark: Create subset dataframe with rows from another based on a column condition 将值分配给Pandas数据框中的行的子集 - Assign value to subset of rows in Pandas dataframe 根据另一个数据帧的一列删除一个数据帧的行 - Remove rows of one Dataframe based on one column of another dataframe pandas 数据框:如何根据列的值聚合行的子集 - pandas dataframe: how to aggregate a subset of rows based on value of a column 根据另一个 dataframe 中的信息填充一个 dataframe - Populate one dataframe based on information in another dataframe 根据条件使用另一个数据帧列值更新一个数据帧值 - Update one dataframe value with another dataframe column value based on the condition 检查一个 dataframe 列是否是另一列的子集 - Checking if one dataframe column is a subset of another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM