简体   繁体   中英

Check if a value in one table (X) is between the values in two columns in another table (Y) with R data.table

Horrible title question, but this is what I am trying to achieve. For Table1 I want to add the Column "BETWEEN", verifying if the "POSITION" falls between any of the "START" and "STOP" values for the corresponding "BIN" in Table2.

Table1 . BIN names (character) and POSITION in BIN (numeric):

  BIN    POSITION
    1          12
    1          52
    1          86
    7           6
    7          22
    X         112
    X         139
   MT           3
   MT          26

Table2 : BIN name (character) and START and STOP positions (numeric)

  BIN    START    STOP
    1        2      64
    1       90     110
    7       20     100
    7      105     200
    X        1       5
   MT        1    1000

And the desired result - Table 1 with "BETWEEN":

CHROM    POSITION      BETWEEN
    1          12         TRUE
    1          52         TRUE
    1          86        FALSE
    7           6        FALSE
    7          22         TRUE
    X         112        FALSE
    X         139        FALSE
   MT           3         TRUE
   MT          26         TRUE

My Table 1 has about 4,000,000 rows, and Table 2 about 500,000 rows, and anything I came up with was very slow.

As an example of bigger tables, use the following:

positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")

tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))

tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50), stop = start+25)

The desired output would be:

tab1
        bin   pos    between
     1:   A     1    FALSE
     2:   A    11    TRUE
     3:   A    21    TRUE
     4:   A    31    FALSE
     5:   A    41    FALSE

The following method requires that for a given bin, the bins are mutually exclusive. (eg you cant have bin A with bounds 1-5 and another bin A with bounds 4-8.) Also, I modified your example a bit.

positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")
tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))
setkey(tab1,"bin","pos")

tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50))
tab2[, end := start+25]

tab2[,pos:=start]
setkey(tab2,"bin","pos")
x<-tab2[tab1, roll=TRUE, nomatch=0]

tab2[,pos:=end]
setkey(tab2,"bin","pos")
y<-tab2[tab1, roll=-Inf, nomatch=0]

setkey(x,"bin","pos","start")
setkey(y,"bin","pos","start")
inBin<-x[y,nomatch=0]
inBin[, between:=TRUE]

setkey(tab1,"bin","pos")
setkey(inBin,"bin","pos")

result<-inBin[,list(bin,pos,between)][tab1]
result[is.na(between), between:=FALSE]

I don't have the time to explain my solution in depth right now. Instead I'll take the cheap way out and refer you to research the roll parameter of data.table. The basic methodology above is that I'm joining tab1 and tab2, rolling pos forward to the nearest end bound. Then I join tab1 and tab2, rolling pos backward to the nearest start bound. Then I do an inner join on the those two sets, giving me all rows in tab1 which fall inside the bounds of a bin. From that point, it's just grunt work.

Most straightforward approach is to nest the matching loops I think. You may be need to handle factors slightly differently. I haven't tested to see what happens if it does not find a bin match.

BIN <- c("1","1","1","7","7","X","X","MT","MT")
POSITION <- c(12,52,86,6,22,112,139,3,26)
npos <- length(POSITION)
BETWEEN <- vector(mode="logical",length=npos)
tab1 <- as.data.frame(cbind(BIN,POSITION))

BIN2 <- c("1","1","7","7","X","MT")
START <- c(2,90,20,105,1,1)
STOP <- c(64,110,100,200,5,1000)
tab2 <- as.data.frame(cbind(BIN2,START,STOP))

bins <- unique(tab1$BIN)

for(bin in bins){
  #print(paste("bin=",bin))
  t1.bin.matches <- which(tab1$BIN==bin)
  t2.bin.compares <- which(tab2$BIN2==bin)
  #print(t1.bin.matches)
  #print(t2.bin.compares)
  for(match in t1.bin.matches){
    between = FALSE
    candidate = as.numeric(as.vector(tab1$POSITION)[match])
    for(compare in t2.bin.compares){
      comp.start <- as.numeric(as.vector(tab2$START)[compare])
      comp.stop <- as.numeric(as.vector(tab2$STOP)[compare])
      if(candidate>=comp.start&&candidate<=comp.stop){
        between = TRUE
        break
      }
    }
    #print(paste(comp.start,candidate,comp.stop,between))
    BETWEEN[match] = between
  }
}
tab1 <- as.data.frame(cbind(tab1,BETWEEN))
tab1

Make sure your BIN columns are character, POSITION, START, END are numeric.

Table1$BIN = as.character(Table1$BIN)
Table1$POSITION = as.numeric(Table1$POSITION)
Table2$BIN = as.character(Table2$BIN)
Table2$START = as.numeric(Table2$START)
Table2$STOP = as.numeric(Table2$STOP)

Convert your data.frame to library(data.table) because the code below might be slow.

Table1 = as.data.table(Table1)
Table2 = as.data.table(Table2)

Generate desired output

z = apply(Table1, 1, function(x) {nrow(Table2[(as.numeric(x[2])>START) & (as.numeric(x[2])<STOP) & (BIN == as.character(x[1])),])>0})
cbind(Table1, z)

Old function is z(), new is y(). With the sample Table1, Table2, the new function is 30% faster. I don't know how this advantage will scale as nrow increases, but I'm guessing this scaling will be very positive. Let me know.

z = function(a){apply(Table1, 1, function(x) {z = subset(Table2, Table2$BIN == as.character(x[1])) 
                                                  any(as.numeric(x[2])>z$START & as.numeric(x[2])<z$STOP)})}

y = function(a){apply(Table1, 1, function(x) {nrow(Table2[(as.numeric(x[2])>START) & (as.numeric(x[2])<STOP) & (BIN == as.character(x[1])),])>0})}


microbenchmark(z(), y(), times = 1000L)

 expr      min       lq   median       uq      max neval
  z() 1168.283 1219.793 1237.791 1276.267 3481.576  1000
  y()  809.575  848.052  863.257  885.909 1683.383  1000

edit: you might need to muck with the as.numeric, and as.character in the subsetting. I lost the data.table I created earlier and directly used the answer above's data.frame.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM