简体   繁体   English

通过运行间隔对data.table进行分组

[英]Grouping a data.table by running intervals

I am using R with package data.table and I would like to group a data.table by running (time) intervals or overlapping bins. 我正在使用R与包data.table,我想通过运行(时间)间隔或重叠箱来分组data.table。 For each of these running intervals I would like to find the occurence of equal pairs of data. 对于这些运行间隔中的每一个,我想找到相等数据对的出现。 Further more these "equal pairs of data" should be not exactly equal, but in some interval range, too. 此外,这些“相等数据对”应该不完全相等,但在某些间隔范围内也应如此。

The simple version of the question is as following: 该问题的简单版本如下:

#Time   X   Y Counts
# ... ... ...      1
#I would like to do:
DT[, sum(counts), by = list(Time, X, Y)]
#with Time, X and Y being in overlapping intervals.

findintervals() would give me bins with "hard borders", not overlapping ones. findintervals()会给我带有“硬边框”的垃圾箱,而不是重叠的垃圾箱。

The problem in more detail: Let's say I have a data.table like that: 更详细的问题:假设我有一个像这样的data.table:

Time    <- c(1,1,2,4,5,5,6,7,8,8,8,8,12,13)
#more equal time values are allowed.
X       <- c(6,6,7,10,5,7,6,3,9,10,6,3,3,6)
Y       <- c(2,6,10,3,4,6,6,9,4,9,6,6,9,9)
DT      <- data.table(Time, X, Y)

    Time  X  Y
 1:    1  6  2
 2:    1  6  6
 3:    2  7 10
 4:    4 10  3
 5:    5  5  4
 6:    5  7  6
 7:    6  6  6
 8:    7  3  9
 9:    8  9  4
10:    8 10  9
11:    8  6  6
12:    8  3  6
13:   12  3  9
14:   13  6  9

And some predefined interval sizes: 以及一些预定义的间隔大小

Timeinterval      <- 5
#for a time value of 10 this means to look from 10-5 to 10+5
RangeX.percentage <- 0.5 
RangeY.percentage <- 0.5

The result should give me an additional column, let's call it "counts" with the occurence of equal pairs of data X and Y considering the ranges for X and Y. 结果应该给我一个额外的列,让我们称之为“计数”,考虑到X和Y的范围,出现相等的数据对X和Y.

I thought about some kind of grouping by time intervals like 我考虑过按时间间隔进行某种分组

c(1, 1, 2, 4, 5, 5, 6) #for the first item: (1-5):(1+5)
c(1, 1, 2, 4, 5, 5, 6, 7) # for the second item: (1-5):(1+5)
c(1, 1, 2, 4, 5, 5, 6, 7, 8, 8, 8, 8) #for the third item (2-5):(2+5)
#...
c(8, 8, 8, 8, 12, 13) # for the last item (13-5):(13+5)

and the following conditions for the data (but maybe there is a simpler version for that part too): 以及数据的以下条件(但也许该部分有一个更简单的版本):

EDIT: To clearify what the result should look like: 编辑:要清除结果应该是什么样子:

Ranges <- DT[ , list(
             X* (1 + RangeX.percentage), X* (1 - RangeX.percentage),
             Y* (1 + RangeY.percentage), Y* (1 - RangeY.percentage))]
DT2 <- cbind(DT, Ranges, count = rep(1, nrow(DT)))
setnames(DT2, c("Time","X","Y","XR1","XR2","YR1","YR2","count"))
for (i in 1:nrow(DT2)){
  #main part of the question how to get this done within data.table:
  DT2.subset <- DT2[which(abs(Time - DT2[i]$Time) < Timeinterval)]
  #subsequent comparison of X and Y:
  DT[i]$Count<- length(which(DT2.subset$X < DT2[i]$XR1 & 
                             DT2.subset$X > DT2[i]$XR2 &
                             DT2.subset$Y < DT2[i]$YR1 & 
                             DT2.subset$Y > DT2[i]$YR2))
}
 DT2
    Time  X  Y  XR1 XR2  YR1 YR2 count
 1:    1  6  2  9.0 3.0  3.0 1.0     1
 2:    1  6  6  9.0 3.0  9.0 3.0     3
 3:    2  7 10 10.5 3.5 15.0 5.0     4
 4:    4 10  3 15.0 5.0  4.5 1.5     3
 5:    5  5  4  7.5 2.5  6.0 2.0     1
 6:    5  7  6 10.5 3.5  9.0 3.0     6
 7:    6  6  6  9.0 3.0  9.0 3.0     4
 8:    7  3  9  4.5 1.5 13.5 4.5     2
 9:    8  9  4 13.5 4.5  6.0 2.0     3
10:    8 10  9 15.0 5.0 13.5 4.5     4
11:    8  6  6  9.0 3.0  9.0 3.0     4
12:    8  3  6  4.5 1.5  9.0 3.0     1
13:   12  3  9  4.5 1.5 13.5 4.5     2
14:   13  6  9  9.0 3.0 13.5 4.5     1

As my complete data.table contains more than a million rows, checking all DT$time for each row is a mess in terms of computation time. 由于我的完整data.table包含超过一百万行,因此检查每行的所有DT $时间在计算时间方面是一团糟。

You could try data.table::foverlaps . 你可以尝试data.table::foverlaps We will create Ranges pretty much as you did, just with addition for Time ranges and a row index (for later aggregation). 我们将像你一样创建Ranges ,只需添加Time范围和行索引(用于以后的聚合)。 The main issue here is that you don't want <= or >= rather < and >, so we will have to add +-1 to the Time intervals. 这里的主要问题是你不希望<=或> =而不是<和>,所以我们必须在Time间隔加上+ -1。 Then, we will add a Time interval to DT too, key, and run foverlaps . 然后,我们也将Time间隔添加到DT ,键,并运行foverlaps The final stage is to count observation per row. 最后阶段是计算每行的观察数。

DT[, Time2 := Time] ## Add higher interval to DT
setkey(DT, Time, Time2) ## key (for foverlaps)

Ranges <- 
  DT[ , .(Time = Time - Timeinterval + 1, ## Add lower Time interval
          Time2 = Time + Timeinterval - 1, ## Add higher Time interval
          XR1 = X* (1 - RangeX.percentage), 
          XR2 = X* (1 + RangeX.percentage),
          YR1 = Y* (1 - RangeY.percentage), 
          YR2 = Y* (1 + RangeY.percentage),
          indx = .I)] ## Add row index

# Run foverlaps and count incidences by condition while updating DT by reference
DT[, 
   count := foverlaps(Ranges, DT)[X > XR1 & X < XR2 & Y > YR1 & Y < YR2,
                                   .N, 
                                   keyby = indx]$N]  
DT
#     Time  X  Y Time2  count
#  1:    1  6  2     1      1
#  2:    1  6  6     1      3
#  3:    2  7 10     2      4
#  4:    4 10  3     4      3
#  5:    5  5  4     5      1
#  6:    5  7  6     5      6
#  7:    6  6  6     6      4
#  8:    7  3  9     7      2
#  9:    8  9  4     8      3
# 10:    8 10  9     8      4
# 11:    8  6  6     8      4
# 12:    8  3  6     8      1
# 13:   12  3  9    12      2
# 14:   13  6  9    13      1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM