简体   繁体   中英

How to apply a function to a subset of data, where the subset is specified by another data.table?

I have a data.table l1 with three columns, Minute, Posixct for time and group_cor for my value, and I would like to calculate the number of unique values of group_cor in certain time intervals based on data.table s1. In my original dataset I have about 1 500 000 data rows lasting approximately 12 days (structured as l1) so I am looking for a fast method to go through all this data.

       Posixct            group_cor   Minute
 1: 2017-08-11 13:31:36       185     2017-08-11 13:31:00
 2: 2017-08-11 13:31:36       185     2017-08-11 13:31:00
 3: 2017-08-11 13:31:36       185     2017-08-11 13:31:00
 4: 2017-08-11 13:31:37       186     2017-08-11 13:31:00
 5: 2017-08-11 13:31:37       186     2017-08-11 13:31:00
 6: 2017-08-11 13:31:37       187     2017-08-11 13:31:00
 7: 2017-08-11 13:31:37       187     2017-08-11 13:31:00
 8: 2017-08-11 13:31:37       187     2017-08-11 13:31:00
 9: 2017-08-11 13:31:37       187     2017-08-11 13:31:00

This is s1 where the start indicates the start of the time interval and end the end of it. Each time interval is one minute and this window is mooved along 1 second at a time.

                     start                 end
  1: 2017-08-11 13:31:36 2017-08-11 13:32:36
  2: 2017-08-11 13:31:37 2017-08-11 13:32:37
  3: 2017-08-11 13:31:38 2017-08-11 13:32:38
  4: 2017-08-11 13:31:39 2017-08-11 13:32:39
  5: 2017-08-11 13:31:40 2017-08-11 13:32:40   

I have tried using data.table to add a column No to the data.table s1 where I use the "on" argument to specify the time window.

oma <- function(x) length(unique(x))
s1[ l1, No:=oma(group_cor), on=c('start<Posixct','end>=Posixct')]

However, this gives

> s1
               start                 end      No
  1: 2017-08-11 13:31:36 2017-08-11 13:32:36 188
  2: 2017-08-11 13:31:37 2017-08-11 13:32:37 188
  3: 2017-08-11 13:31:38 2017-08-11 13:32:38 188
  4: 2017-08-11 13:31:39 2017-08-11 13:32:39 188
  5: 2017-08-11 13:31:40 2017-08-11 13:32:40 188 

The No column is 188 for all the time windows, which is not true (and I dont know where this value comes from..)

> range(s1$No)
 [1] 188 188   

I know the amount of unique values for each minute and the new No should be similar to them

> tapply(l1$group_cor, l1$Minute,oma)
2017-08-11 13:31:00 2017-08-11 13:32:00 2017-08-11 13:33:00 2017-08-11     13:34:00 
             11                  17                  18                  17 
2017-08-11 13:35:00 2017-08-11 13:36:00 2017-08-11 13:37:00 2017-08-11 13:38:00 
             21                  22                  23                  22 
2017-08-11 13:39:00 2017-08-11 13:40:00 
             20                  22     

What am I doing wrong? Any help would be highly appreciated! Also suggestions to how I could do this in another way.. Thank you very much.

If I understand you correctly and which is what Frank mentioned in the comments, you are looking for

intvl[dat, cnt := uniqueN(group_cor), by=.EACHI, on=c('start<Posixct','end>=Posixct')][, 
   cnt := replace(cnt, is.na(cnt), 0L)]

output:

                 start                 end cnt
1: 2017-08-11 13:31:36 2017-08-11 13:32:36   1
2: 2017-08-11 13:31:37 2017-08-11 13:32:37   0
3: 2017-08-11 13:31:38 2017-08-11 13:32:38   0
4: 2017-08-11 13:31:39 2017-08-11 13:32:39   0
5: 2017-08-11 13:31:40 2017-08-11 13:32:40   0

data:

library(data.table)
dat <- fread("Posixct,group_cor,Minute
2017-08-11 13:31:36,185,2017-08-11 13:31:00
2017-08-11 13:31:36,185,2017-08-11 13:31:00
2017-08-11 13:31:36,185,2017-08-11 13:31:00
2017-08-11 13:31:37,186,2017-08-11 13:31:00
2017-08-11 13:31:37,186,2017-08-11 13:31:00
2017-08-11 13:31:37,187,2017-08-11 13:31:00
2017-08-11 13:31:37,187,2017-08-11 13:31:00
2017-08-11 13:31:37,187,2017-08-11 13:31:00
2017-08-11 13:31:37,187,2017-08-11 13:31:00")
cols <- c("Posixct", "Minute")
dat[, (cols) := lapply(.SD, as.POSIXct, format="%Y-%m-%d %H:%M:%S"), .SDcols=cols]

intvl <- fread("start,end
2017-08-11 13:31:36,2017-08-11 13:32:36
2017-08-11 13:31:37,2017-08-11 13:32:37
2017-08-11 13:31:38,2017-08-11 13:32:38
2017-08-11 13:31:39,2017-08-11 13:32:39
2017-08-11 13:31:40,2017-08-11 13:32:40")
cols <- c("start", "end")
intvl[, (cols) := lapply(.SD, as.POSIXct, format="%Y-%m-%d %H:%M:%S"), .SDcols=cols]

I think you couldn't get it previously is because you had too many different variables in your R session. It would help to restart the session and use a clean data and interval.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM