基于 R 中两种不同评估方法的聚合数据

Question

I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment.我希望汇总一些计步器数据，以每分钟步数为单位收集，因此我得到了在 EMA 评估之前的总步数。 The EMA assessments happened four times per day. EMA 评估每天进行四次。 An example of the two data sets are:两个数据集的一个例子是：

Pedometer Data计步器数据

ID Steps      Time
1   15   2/4/2020 8:32
1   23   2/4/2020 8:33
1   76   2/4/2020 8:34
1   32   2/4/2020 8:35
1   45   2/4/2020 8:36
...
2   16   2/4/2020 8:32
2   17   2/4/2020 8:33
2   0    2/4/2020 8:34
2   5    2/4/2020 8:35
2   8    2/4/2020 8:36

EMA Data EMA数据

ID      Time      X Y
1  2/4/2020 8:36  3 4
1  2/4/2020 12:01 3 5
1  2/4/2020 3:30  4 5
1  2/4/2020 6:45  7 8
...
2  2/4/2020 8:35  4 6
2  2/4/2020 12:05 5 7
2  2/4/2020 3:39  1 3
2  2/4/2020 6:55  8 3

I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment.我希望将计步器数据作为新变量添加到 EMA 数据中，其中所采取的步数相加，直到下一次 EMA 评估。 Ideally it would like something like:理想情况下，它会像这样：

Combined Data组合数据

ID      Time      X Y Steps
1  2/4/2020 8:36  3 4 191
1  2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1  2/4/2020 3:30  4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1  2/4/2020 6:45  7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2  2/4/2020 8:35  4 6 38
2  2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2  2/4/2020 3:39  1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2  2/4/2020 6:55  8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]

I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.然后，我需要在整个 21 天 EMA 期间继续该过程，因此对于 2/5/2020、2/6/2020 等 4 个 EMA 评估时间点的相同过程。

This has pushed me the limit of my R skills, so any pointers would be extremely helpful!这将我推到了 R 技能的极限，所以任何指针都会非常有帮助！ I'm most familiar with the tidyverse but am comfortable using base R as well.我对 tidyverse 最为熟悉，但也很喜欢使用 base R。 Thanks in advance for all advice.在此先感谢您的所有建议。

Answer 1

I would left_join ema_df on pedometer_df by ID and Time .我会left_join ema_df上pedometer_df通过ID和Time 。 This way you get all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.这样，当它不是 EMA 评估时间时，您可以获得pedometer_df所有行，其中x和y缺失值（我假设是标识符）。

I fill the values using the next available (so the next ema assessment x and y ) and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.我使用下一个可用的值（因此是下一个 ema 评估x和y ），最后使用group_by ID x和y填充值并summarise以保留评估的日期时间（最大值）和步骤总和。

library(dplyr)
library(tidyr)

pedometer_df %>%
  left_join(ema_df, by = c("ID", "Time")) %>%
  fill(x, y, .direction = "up") %>%
  group_by(ID, x, y) %>%
  summarise(
    Time = max(Time),
    Steps = sum(Steps)
  )

Answer 2

Here's a solution using rolling joins from data.table .这是使用来自data.table滚动连接的解决方案。 The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still).这里的基本思想是从每次推出pedometer数据直至下一次在EMA数据（而在ID匹配仍）。 Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps .一旦找到下一个 EMA 时间，剩下的就是隔离X和Y值并总结Steps 。

Data creation and prep:数据创建和准备：

library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)), 
                        Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"), 
                                              as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
                        Steps = rpois(1000, 25))

EMA <- data.table(ID = sort(rep(1:2, 4*5)),
                  Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"), 
                                        as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
                  X = sample(1:8, 2*4*5, rep = T),
                  Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]

And now the actual join and summation:现在实际的连接和求和：

joined <- EMA[pedometer, 
              on = .(ID, Time), 
              roll = -Inf, 
              j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
                    'Y' = min(Y),
                    'Steps' = sum(Steps)),
                 .(ID, next_ema_time)]
result
#>     ID       next_ema_time X Y Steps
#>  1:  1 2020-02-04 11:00:00 1 2   167
#>  2:  2 2020-02-04 11:00:00 8 5   169
#>  3:  1 2020-02-04 17:00:00 3 6   740
#>  4:  2 2020-02-04 17:00:00 4 6   747
#>  5:  1 2020-02-04 23:00:00 2 2   679
#>  6:  2 2020-02-04 23:00:00 3 2   732
#>  7:  1 2020-02-05 05:00:00 7 5   720
#>  8:  2 2020-02-05 05:00:00 6 8   692
#>  9:  1 2020-02-05 11:00:00 2 4   731
#> 10:  2 2020-02-05 11:00:00 4 5   773
#> 11:  1 2020-02-05 17:00:00 1 5   757
#> 12:  2 2020-02-05 17:00:00 3 5   743
#> 13:  1 2020-02-05 23:00:00 3 8   693
#> 14:  2 2020-02-05 23:00:00 1 8   740
#> 15:  1 2020-02-06 05:00:00 8 8   710
#> 16:  2 2020-02-06 05:00:00 3 2   760
#> 17:  1 2020-02-06 11:00:00 8 4   716
#> 18:  2 2020-02-06 11:00:00 1 2   688
#> 19:  1 2020-02-06 17:00:00 5 2   738
#> 20:  2 2020-02-06 17:00:00 4 6   724
#> 21:  1 2020-02-06 23:00:00 7 8   737
#> 22:  2 2020-02-06 23:00:00 6 3   672
#> 23:  1 2020-02-07 05:00:00 2 6   726
#> 24:  2 2020-02-07 05:00:00 7 7   759
#> 25:  1 2020-02-07 11:00:00 1 4   737
#> 26:  2 2020-02-07 11:00:00 5 2   737
#> 27:  1 2020-02-07 17:00:00 3 5   766
#> 28:  2 2020-02-07 17:00:00 4 4   745
#> 29:  1 2020-02-07 23:00:00 3 3   714
#> 30:  2 2020-02-07 23:00:00 2 1   741
#> 31:  1 2020-02-08 05:00:00 4 6   751
#> 32:  2 2020-02-08 05:00:00 8 2   723
#> 33:  1 2020-02-08 11:00:00 3 3   716
#> 34:  2 2020-02-08 11:00:00 3 6   735
#> 35:  1 2020-02-08 17:00:00 1 5   696
#> 36:  2 2020-02-08 17:00:00 7 7   741
#>     ID       next_ema_time X Y Steps

^{Created on 2020-02-04 by the reprex package (v0.3.0)}^{由reprex 包(v0.3.0) 于 2020 年 2 月 4 日创建}

基于 R 中两种不同评估方法的聚合数据

问题描述

2 个解决方案

解决方案1
0 2020-02-04 14:23:55

解决方案2
0 已采纳 2020-02-04 14:59:56

基于 R 中两种不同评估方法的聚合数据

问题描述

2 个解决方案

解决方案1 0 2020-02-04 14:23:55

解决方案2 0 已采纳 2020-02-04 14:59:56

解决方案1
0 2020-02-04 14:23:55

解决方案2
0 已采纳 2020-02-04 14:59:56