Aggregate Data based on Two Different Assessment Methods in R

Question

I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment. The EMA assessments happened four times per day. An example of the two data sets are:

Pedometer Data

ID Steps      Time
1   15   2/4/2020 8:32
1   23   2/4/2020 8:33
1   76   2/4/2020 8:34
1   32   2/4/2020 8:35
1   45   2/4/2020 8:36
...
2   16   2/4/2020 8:32
2   17   2/4/2020 8:33
2   0    2/4/2020 8:34
2   5    2/4/2020 8:35
2   8    2/4/2020 8:36

EMA Data

ID      Time      X Y
1  2/4/2020 8:36  3 4
1  2/4/2020 12:01 3 5
1  2/4/2020 3:30  4 5
1  2/4/2020 6:45  7 8
...
2  2/4/2020 8:35  4 6
2  2/4/2020 12:05 5 7
2  2/4/2020 3:39  1 3
2  2/4/2020 6:55  8 3

I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment. Ideally it would like something like:

Combined Data

ID      Time      X Y Steps
1  2/4/2020 8:36  3 4 191
1  2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1  2/4/2020 3:30  4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1  2/4/2020 6:45  7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2  2/4/2020 8:35  4 6 38
2  2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2  2/4/2020 3:39  1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2  2/4/2020 6:55  8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]

I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.

This has pushed me the limit of my R skills, so any pointers would be extremely helpful! I'm most familiar with the tidyverse but am comfortable using base R as well. Thanks in advance for all advice.

Answer 1

I would left_join ema_df on pedometer_df by ID and Time . This way you get all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.

I fill the values using the next available (so the next ema assessment x and y ) and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.

library(dplyr)
library(tidyr)

pedometer_df %>%
  left_join(ema_df, by = c("ID", "Time")) %>%
  fill(x, y, .direction = "up") %>%
  group_by(ID, x, y) %>%
  summarise(
    Time = max(Time),
    Steps = sum(Steps)
  )

Answer 2

Here's a solution using rolling joins from data.table . The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still). Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps .

Data creation and prep:

library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)), 
                        Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"), 
                                              as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
                        Steps = rpois(1000, 25))

EMA <- data.table(ID = sort(rep(1:2, 4*5)),
                  Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"), 
                                        as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
                  X = sample(1:8, 2*4*5, rep = T),
                  Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]

And now the actual join and summation:

joined <- EMA[pedometer, 
              on = .(ID, Time), 
              roll = -Inf, 
              j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
                    'Y' = min(Y),
                    'Steps' = sum(Steps)),
                 .(ID, next_ema_time)]
result
#>     ID       next_ema_time X Y Steps
#>  1:  1 2020-02-04 11:00:00 1 2   167
#>  2:  2 2020-02-04 11:00:00 8 5   169
#>  3:  1 2020-02-04 17:00:00 3 6   740
#>  4:  2 2020-02-04 17:00:00 4 6   747
#>  5:  1 2020-02-04 23:00:00 2 2   679
#>  6:  2 2020-02-04 23:00:00 3 2   732
#>  7:  1 2020-02-05 05:00:00 7 5   720
#>  8:  2 2020-02-05 05:00:00 6 8   692
#>  9:  1 2020-02-05 11:00:00 2 4   731
#> 10:  2 2020-02-05 11:00:00 4 5   773
#> 11:  1 2020-02-05 17:00:00 1 5   757
#> 12:  2 2020-02-05 17:00:00 3 5   743
#> 13:  1 2020-02-05 23:00:00 3 8   693
#> 14:  2 2020-02-05 23:00:00 1 8   740
#> 15:  1 2020-02-06 05:00:00 8 8   710
#> 16:  2 2020-02-06 05:00:00 3 2   760
#> 17:  1 2020-02-06 11:00:00 8 4   716
#> 18:  2 2020-02-06 11:00:00 1 2   688
#> 19:  1 2020-02-06 17:00:00 5 2   738
#> 20:  2 2020-02-06 17:00:00 4 6   724
#> 21:  1 2020-02-06 23:00:00 7 8   737
#> 22:  2 2020-02-06 23:00:00 6 3   672
#> 23:  1 2020-02-07 05:00:00 2 6   726
#> 24:  2 2020-02-07 05:00:00 7 7   759
#> 25:  1 2020-02-07 11:00:00 1 4   737
#> 26:  2 2020-02-07 11:00:00 5 2   737
#> 27:  1 2020-02-07 17:00:00 3 5   766
#> 28:  2 2020-02-07 17:00:00 4 4   745
#> 29:  1 2020-02-07 23:00:00 3 3   714
#> 30:  2 2020-02-07 23:00:00 2 1   741
#> 31:  1 2020-02-08 05:00:00 4 6   751
#> 32:  2 2020-02-08 05:00:00 8 2   723
#> 33:  1 2020-02-08 11:00:00 3 3   716
#> 34:  2 2020-02-08 11:00:00 3 6   735
#> 35:  1 2020-02-08 17:00:00 1 5   696
#> 36:  2 2020-02-08 17:00:00 7 7   741
#>     ID       next_ema_time X Y Steps

^{Created on 2020-02-04 by the reprex package (v0.3.0)}

Aggregate Data based on Two Different Assessment Methods in R

Question

2 answers

solution1
0 2020-02-04 14:23:55

solution2
0 ACCPTED 2020-02-04 14:59:56

Aggregate Data based on Two Different Assessment Methods in R

Question

2 answers

solution1 0 2020-02-04 14:23:55

solution2 0 ACCPTED 2020-02-04 14:59:56

solution1
0 2020-02-04 14:23:55

solution2
0 ACCPTED 2020-02-04 14:59:56