[英]Finding the overlap between two data frames in R, how can I make my code more efficient?
我在R中有两个数据框。在第一个中,我有两列,一列称为“ chr”,另一列称为“ position”。 在第二个数据帧中,我有三列,一列是“ chr”,另一列是“ start”,另一列是“ end”。 我想选择第一个数据帧中的chr值与第二个数据帧相同的行,但其“位置”在第二个数据帧的间隔开始-结尾之间。
为此,我在R中编写了一个函数,该函数可以为我提供所需的输出,但是当我使用庞大的数据帧运行它时,它的运行速度非常慢。
# My DataFrames are:
bed <- data.frame(Chr = c(rep("chr1",4),rep("chr2",3),rep("chr3",1)),
x1 = c(5,20,44,67,5,20,44,20),
x3=c(12,43,64,94,12,43,64,63))
snv <- data.frame(Chr = c(rep("chr1",6),rep("chr3",6)),
position = c(5,18,46,60,80,90,21,60,75,80,84,87))
# My function is:
get_overlap <- function(df, position, chrom){
overlap <- FALSE
for (row in 1:nrow(df)){
chr = df[row, 1]
start = df[row, 2]
end = df[row, 3]
if(chr == chrom & position %in% seq(start, end)){
overlap <- TRUE
}
}
return(overlap)
}
# The code is:
overlap_vector = c()
for (row in 1:nrow(snv)){
chrom = snv[row, 1]
position = snv[row, 2]
overlap <- get_overlap(bed, position, chrom)
overlap_vector <- c(overlap_vector, overlap)
}
print(snv[overlap_vector,])
如何提高效率? 我从未使用过哈希表,这可以解决吗?
我敢肯定,有一个更优雅的data.table解决方案,但这可行 。 首先,我加载包裹。
# Load package
library(data.table)
然后,我定义数据表
# Define data tables
bed <- data.table(Chr = c(rep("chr1",4),rep("chr2",3),rep("chr3",1)),
start = c(5,20,44,67,5,20,44,20),
end = c(12,43,64,94,12,43,64,63))
snv <- data.table(Chr = c(rep("chr1",6),rep("chr3",6)),
position = c(5,18,46,60,80,90,21,60,75,80,84,87))
在这里,我在position
和start
/ end
上进行非等距联接,并在Chr
上进行均等Chr
。 我假设您想保留所有列,因此请在j
参数中指定它们,并省略那些没有匹配项的行。
na.omit(bed[snv,
.(Chr, start = x.start, end = x.end, position = i.position),
on = c("start <= position", "end >= position", "Chr == Chr")])
#> Chr start end position
#> 1: chr1 5 12 5
#> 2: chr1 44 64 46
#> 3: chr1 44 64 60
#> 4: chr1 67 94 80
#> 5: chr1 67 94 90
#> 6: chr3 20 63 21
#> 7: chr3 20 63 60
由reprex软件包 (v0.3.0)创建于2019-08-21
快速的基准测试表明,Nathan的解决方案的速度大约是其两倍!
Unit: milliseconds
expr min lq mean median uq max neval
NathanWren() 1.684392 1.729557 1.819263 1.751520 1.787829 5.138546 100
Lyngbakr() 3.336902 3.395528 3.603376 3.441933 3.496131 7.720925 100
data.table
包非常适合快速合并表。 它还为此类任务提供了功能between
的向量化。
library(data.table)
# Convert the data.frames to data.tables
setDT(bed)
setDT(snv)
# Use the join syntax for data.table, then filter for the desired rows
overlap_dt <- bed[
snv,
on = "Chr",
allow.cartesian = TRUE # many-to-many matching
][
between(position, lower = x1, upper = x3)
]
overlap_dt
# Chr x1 x3 position
# 1: chr1 5 12 5
# 2: chr1 44 64 46
# 3: chr1 44 64 60
# 4: chr1 67 94 80
# 5: chr1 67 94 90
# 6: chr3 20 63 21
# 7: chr3 20 63 60
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.