简体   繁体   English

如何将函数应用于数据子集,而该子集由另一个data.table指定?

[英]How to apply a function to a subset of data, where the subset is specified by another data.table?

I have a data.table l1 with three columns, Minute, Posixct for time and group_cor for my value, and I would like to calculate the number of unique values of group_cor in certain time intervals based on data.table s1. 我有一个data.table l1,它具有三列,Minute,Posixct用于时间和group_cor作为我的值,并且我想根据data.table s1在特定时间间隔内计算group_cor的唯一值的数量。 In my original dataset I have about 1 500 000 data rows lasting approximately 12 days (structured as l1) so I am looking for a fast method to go through all this data. 在我的原始数据集中,我有大约1 500 000个数据行,持续约12天(结构为l1),因此我正在寻找一种快速的方法来处理所有这些数据。

       Posixct            group_cor   Minute
 1: 2017-08-11 13:31:36       185     2017-08-11 13:31:00
 2: 2017-08-11 13:31:36       185     2017-08-11 13:31:00
 3: 2017-08-11 13:31:36       185     2017-08-11 13:31:00
 4: 2017-08-11 13:31:37       186     2017-08-11 13:31:00
 5: 2017-08-11 13:31:37       186     2017-08-11 13:31:00
 6: 2017-08-11 13:31:37       187     2017-08-11 13:31:00
 7: 2017-08-11 13:31:37       187     2017-08-11 13:31:00
 8: 2017-08-11 13:31:37       187     2017-08-11 13:31:00
 9: 2017-08-11 13:31:37       187     2017-08-11 13:31:00

This is s1 where the start indicates the start of the time interval and end the end of it. 这是s1,其中start指示时间间隔的开始,结束时间间隔的结束。 Each time interval is one minute and this window is mooved along 1 second at a time. 每个时间间隔为一分钟,并且此窗口一次移动1秒。

                     start                 end
  1: 2017-08-11 13:31:36 2017-08-11 13:32:36
  2: 2017-08-11 13:31:37 2017-08-11 13:32:37
  3: 2017-08-11 13:31:38 2017-08-11 13:32:38
  4: 2017-08-11 13:31:39 2017-08-11 13:32:39
  5: 2017-08-11 13:31:40 2017-08-11 13:32:40   

I have tried using data.table to add a column No to the data.table s1 where I use the "on" argument to specify the time window. 我尝试使用data.table在data.table s1中添加一列No,在其中我使用“ on”参数指定时间窗口。

oma <- function(x) length(unique(x))
s1[ l1, No:=oma(group_cor), on=c('start<Posixct','end>=Posixct')]

However, this gives 但是,这给

> s1
               start                 end      No
  1: 2017-08-11 13:31:36 2017-08-11 13:32:36 188
  2: 2017-08-11 13:31:37 2017-08-11 13:32:37 188
  3: 2017-08-11 13:31:38 2017-08-11 13:32:38 188
  4: 2017-08-11 13:31:39 2017-08-11 13:32:39 188
  5: 2017-08-11 13:31:40 2017-08-11 13:32:40 188 

The No column is 188 for all the time windows, which is not true (and I dont know where this value comes from..) 对于所有时间窗口,“否”列均为188,这是不正确的(而且我不知道该值从何而来。)

> range(s1$No)
 [1] 188 188   

I know the amount of unique values for each minute and the new No should be similar to them 我知道每分钟唯一值的数量,新的“否”应该与它们相似

> tapply(l1$group_cor, l1$Minute,oma)
2017-08-11 13:31:00 2017-08-11 13:32:00 2017-08-11 13:33:00 2017-08-11     13:34:00 
             11                  17                  18                  17 
2017-08-11 13:35:00 2017-08-11 13:36:00 2017-08-11 13:37:00 2017-08-11 13:38:00 
             21                  22                  23                  22 
2017-08-11 13:39:00 2017-08-11 13:40:00 
             20                  22     

What am I doing wrong? 我究竟做错了什么? Any help would be highly appreciated! 任何帮助将不胜感激! Also suggestions to how I could do this in another way.. Thank you very much. 也建议我如何用另一种方式来做。。非常感谢。

If I understand you correctly and which is what Frank mentioned in the comments, you are looking for 如果我正确理解您的要求,而弗兰克在评论中提到了这一点,那么您正在寻找

intvl[dat, cnt := uniqueN(group_cor), by=.EACHI, on=c('start<Posixct','end>=Posixct')][, 
   cnt := replace(cnt, is.na(cnt), 0L)]

output: 输出:

                 start                 end cnt
1: 2017-08-11 13:31:36 2017-08-11 13:32:36   1
2: 2017-08-11 13:31:37 2017-08-11 13:32:37   0
3: 2017-08-11 13:31:38 2017-08-11 13:32:38   0
4: 2017-08-11 13:31:39 2017-08-11 13:32:39   0
5: 2017-08-11 13:31:40 2017-08-11 13:32:40   0

data: 数据:

library(data.table)
dat <- fread("Posixct,group_cor,Minute
2017-08-11 13:31:36,185,2017-08-11 13:31:00
2017-08-11 13:31:36,185,2017-08-11 13:31:00
2017-08-11 13:31:36,185,2017-08-11 13:31:00
2017-08-11 13:31:37,186,2017-08-11 13:31:00
2017-08-11 13:31:37,186,2017-08-11 13:31:00
2017-08-11 13:31:37,187,2017-08-11 13:31:00
2017-08-11 13:31:37,187,2017-08-11 13:31:00
2017-08-11 13:31:37,187,2017-08-11 13:31:00
2017-08-11 13:31:37,187,2017-08-11 13:31:00")
cols <- c("Posixct", "Minute")
dat[, (cols) := lapply(.SD, as.POSIXct, format="%Y-%m-%d %H:%M:%S"), .SDcols=cols]

intvl <- fread("start,end
2017-08-11 13:31:36,2017-08-11 13:32:36
2017-08-11 13:31:37,2017-08-11 13:32:37
2017-08-11 13:31:38,2017-08-11 13:32:38
2017-08-11 13:31:39,2017-08-11 13:32:39
2017-08-11 13:31:40,2017-08-11 13:32:40")
cols <- c("start", "end")
intvl[, (cols) := lapply(.SD, as.POSIXct, format="%Y-%m-%d %H:%M:%S"), .SDcols=cols]

I think you couldn't get it previously is because you had too many different variables in your R session. 我认为您以前无法获得它是因为R会话中有太多不同的变量。 It would help to restart the session and use a clean data and interval. 这将有助于重新启动会话并使用干净的数据和时间间隔。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM