简体   繁体   中英

Finding the overlap between two data frames in R, how can I make my code more efficient?

I have two dataframes in R. In the first one I have two columns one is called "chr" and the other "position"; in the second dataframe I have three columns one is again "chr", other "start" and another one "end". I want to select those rows in the first dataframe in which chr value is the same as the second data frame, but also whose "position" is in the interval start-end of the second data frame.

For that I have written a function in R that gives me the desired output but it is very slow when I run it with huge data frames.

# My DataFrames are:

bed <- data.frame(Chr = c(rep("chr1",4),rep("chr2",3),rep("chr3",1)),
                  x1 = c(5,20,44,67,5,20,44,20),
                  x3=c(12,43,64,94,12,43,64,63))

snv <- data.frame(Chr = c(rep("chr1",6),rep("chr3",6)),
                  position = c(5,18,46,60,80,90,21,60,75,80,84,87))

# My function is:

get_overlap <- function(df, position, chrom){
  overlap <- FALSE
  for (row in 1:nrow(df)){
    chr = df[row, 1]
    start = df[row, 2]
    end = df[row, 3]
    if(chr == chrom & position %in% seq(start, end)){
      overlap <- TRUE
    }
    }
  return(overlap)
}

# The code is:

overlap_vector = c()
for (row in 1:nrow(snv)){
  chrom = snv[row, 1]
  position = snv[row, 2]
  overlap <- get_overlap(bed, position, chrom)
  overlap_vector <- c(overlap_vector, overlap)
}

print(snv[overlap_vector,])

How can I make this more efficient? I have never worked with hash tables, can that be the answer?

I'm sure there's a more elegant solution, but this works. First I load the package.

# Load package
library(data.table)

Then, I define the data tables

# Define data tables
bed <- data.table(Chr = c(rep("chr1",4),rep("chr2",3),rep("chr3",1)),
                  start = c(5,20,44,67,5,20,44,20),
                  end = c(12,43,64,94,12,43,64,63))

snv <- data.table(Chr = c(rep("chr1",6),rep("chr3",6)),
                  position = c(5,18,46,60,80,90,21,60,75,80,84,87))

Here, I do a non-equi join on position and start / end , and an equal join on Chr . I assume you want to keep all columns, so specified them in the j argument and omitted those rows without matches.

na.omit(bed[snv, 
            .(Chr, start = x.start, end = x.end, position = i.position), 
            on = c("start <= position", "end >= position", "Chr == Chr")])
#>     Chr start end position
#> 1: chr1     5  12        5
#> 2: chr1    44  64       46
#> 3: chr1    44  64       60
#> 4: chr1    67  94       80
#> 5: chr1    67  94       90
#> 6: chr3    20  63       21
#> 7: chr3    20  63       60

Created on 2019-08-21 by the reprex package (v0.3.0)


Edit

A quick benchmarking shows that Nathan's solution is about as twice as fast!

Unit: milliseconds
         expr      min       lq     mean   median       uq      max neval
 NathanWren() 1.684392 1.729557 1.819263 1.751520 1.787829 5.138546   100
   Lyngbakr() 3.336902 3.395528 3.603376 3.441933 3.496131 7.720925   100

The data.table package is great for fast merging of tables. It also comes with a vectorized between function for just this type of task.

library(data.table)

# Convert the data.frames to data.tables
setDT(bed)
setDT(snv)

# Use the join syntax for data.table, then filter for the desired rows
overlap_dt <- bed[
  snv,
  on = "Chr",
  allow.cartesian = TRUE # many-to-many matching
][
  between(position, lower = x1, upper = x3)
]

overlap_dt
#     Chr x1 x3 position
# 1: chr1  5 12        5
# 2: chr1 44 64       46
# 3: chr1 44 64       60
# 4: chr1 67 94       80
# 5: chr1 67 94       90
# 6: chr3 20 63       21
# 7: chr3 20 63       60

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM