简体   繁体   中英

Find a subset of columns based on another dataframe?

I'm collecting heart rate data across time for multiple subjects. Different events occur during the course of the data collection, so the start of each event is recorded elsewhere. Each event would have started at at a slightly different time for each subject. I would like to bridge the information between the two data frames so that I can know the mean heart rate of the different subjects during each chunk of time marked as an event. How can I get the mean heart rates between certain time points that are marked as events in another data frame? For instance, how can I find the mean heart rate between event 2 and event 3?

import pandas as pd 
import numpy as np

#example 
example_g = [["4/20/21 4:20", 302, 0, 1, 2, 3, 4, 5],
       ["2/17/21 9:20",135, 1, 1.4, 1.8, 2, 8, 10],
       ["2/17/21 9:20", 111, 4, 5, 5.1, 5.2, 5.3, 5.4]]
example_g_table = pd.DataFrame(example_g,columns=['Date_Time','CID', 0, 1, 2, 3, 4, 5])

#Example Timestamps
example_s = [["4/20/21 4:20",302,0, 2, 3],
       ["2/17/21 9:20",135,0, 1, 4 ],
       ["2/17/21 9:20",111,3, 4, 5 ]]
example_s_table = pd.DataFrame(example_s,columns=['Date_Time','CID', "event_1", "event_2", "event_3"])

desired_outcome = [["4/20/21 4:20",302,2.5],
       ["2/17/21 9:20",135, 3.3 ],
       ["2/17/21 9:20",111, 5.35 ]]

desired_outcome_table = pd.DataFrame(desired_outcome,columns=['Date_Time','CID', "Average of data between Event 2 and Event 3"])

I was able to put together a function that I think works for this, but assumes that columns don't change orders or more get added. If there would be changes to the df shape, this would need to be updated for that.

First, I merged together your example_g_table and example_s_table to get them all together.

df = pd.merge(left=example_g_table,right=example_s_table,on=['Date_Time','CID'],how='left')
       Date_Time    CID 0   1   2   3   4   5   event_1 event_2 event_3
0   4/20/21 4:20    302 0   1.0 2.0 3.0 4.0 5.0     0   2   3
1   2/17/21 9:20    135 1   1.4 1.8 2.0 8.0 10.0    0   1   4
2   2/17/21 9:20    111 4   5.0 5.1 5.2 5.3 5.4     3   4   5

Now we use a new function that will pull out the values of event_2 and event_3 , and return the average of the values of those previous column-values. We will later run df.apply on this, so it will take in just a row at a time, as a series (I think, anyway).

def func(df):
    event_2 = df['event_2']
    event_3 = df['event_3']
    start = int(event_2 + 2) # this assumes that the column called 0 will be the third (and starting at 0, it'll be the called 2), column 1 will be the third column, etc
    end = int(event_3 + 2) # same as above
    total = sum(df.iloc[start:end+1]) # this line is the key. It takes the sum of the values of columns in the range of start to finish
    avg = total/(end-start+1) #(end-start+1) gets the count of things in our range
    return avg

Last, we run df.apply on this to get our new column.

df['avg'] = df.apply(func,axis=1)
df
       Date_Time    CID 0   1   2   3   4   5   event_1 event_2 event_3 avg
0   4/20/21 4:20    302 0   1.0 2.0 3.0 4.0 5.0     0   2          3    2.50
1   2/17/21 9:20    135 1   1.4 1.8 2.0 8.0 10.0    0   1          4    3.30
2   2/17/21 9:20    111 4   5.0 5.1 5.2 5.3 5.4     3   4          5    5.35

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM