I have two dataframes in R. In the first one I have two columns one is called "chr" and the other "position"; in the second dataframe I have three columns one is again "chr", other "start" and another one "end". I want to select those rows in the first dataframe in which chr value is the same as the second data frame, but also whose "position" is in the interval start-end of the second data frame.
For that I have written a function in R that gives me the desired output but it is very slow when I run it with huge data frames.
# My DataFrames are:
bed <- data.frame(Chr = c(rep("chr1",4),rep("chr2",3),rep("chr3",1)),
x1 = c(5,20,44,67,5,20,44,20),
x3=c(12,43,64,94,12,43,64,63))
snv <- data.frame(Chr = c(rep("chr1",6),rep("chr3",6)),
position = c(5,18,46,60,80,90,21,60,75,80,84,87))
# My function is:
get_overlap <- function(df, position, chrom){
overlap <- FALSE
for (row in 1:nrow(df)){
chr = df[row, 1]
start = df[row, 2]
end = df[row, 3]
if(chr == chrom & position %in% seq(start, end)){
overlap <- TRUE
}
}
return(overlap)
}
# The code is:
overlap_vector = c()
for (row in 1:nrow(snv)){
chrom = snv[row, 1]
position = snv[row, 2]
overlap <- get_overlap(bed, position, chrom)
overlap_vector <- c(overlap_vector, overlap)
}
print(snv[overlap_vector,])
How can I make this more efficient? I have never worked with hash tables, can that be the answer?
I'm sure there's a more elegant data.table solution, but this works. First I load the package.
# Load package
library(data.table)
Then, I define the data tables
# Define data tables
bed <- data.table(Chr = c(rep("chr1",4),rep("chr2",3),rep("chr3",1)),
start = c(5,20,44,67,5,20,44,20),
end = c(12,43,64,94,12,43,64,63))
snv <- data.table(Chr = c(rep("chr1",6),rep("chr3",6)),
position = c(5,18,46,60,80,90,21,60,75,80,84,87))
Here, I do a non-equi join on position
and start
/ end
, and an equal join on Chr
. I assume you want to keep all columns, so specified them in the j
argument and omitted those rows without matches.
na.omit(bed[snv,
.(Chr, start = x.start, end = x.end, position = i.position),
on = c("start <= position", "end >= position", "Chr == Chr")])
#> Chr start end position
#> 1: chr1 5 12 5
#> 2: chr1 44 64 46
#> 3: chr1 44 64 60
#> 4: chr1 67 94 80
#> 5: chr1 67 94 90
#> 6: chr3 20 63 21
#> 7: chr3 20 63 60
Created on 2019-08-21 by the reprex package (v0.3.0)
A quick benchmarking shows that Nathan's solution is about as twice as fast!
Unit: milliseconds
expr min lq mean median uq max neval
NathanWren() 1.684392 1.729557 1.819263 1.751520 1.787829 5.138546 100
Lyngbakr() 3.336902 3.395528 3.603376 3.441933 3.496131 7.720925 100
The data.table
package is great for fast merging of tables. It also comes with a vectorized between
function for just this type of task.
library(data.table)
# Convert the data.frames to data.tables
setDT(bed)
setDT(snv)
# Use the join syntax for data.table, then filter for the desired rows
overlap_dt <- bed[
snv,
on = "Chr",
allow.cartesian = TRUE # many-to-many matching
][
between(position, lower = x1, upper = x3)
]
overlap_dt
# Chr x1 x3 position
# 1: chr1 5 12 5
# 2: chr1 44 64 46
# 3: chr1 44 64 60
# 4: chr1 67 94 80
# 5: chr1 67 94 90
# 6: chr3 20 63 21
# 7: chr3 20 63 60
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.