[英]How to filter rows by column value ranges in R?
I have 2 genetic datasets.我有 2 个基因数据集。 One that defines ranges in the genome per row, and another dataset that is rows of gene length ranges that I want to make sure do not have any overlap with the ranges in the first dataset.一个定义每行基因组中的范围,另一个数据集是基因长度范围的行,我想确保与第一个数据集中的范围没有任何重叠。
For example, my data looks like this:例如,我的数据如下所示:
#df1:
Chromosome Min Max
1 10 500
1 450 550
2 20 100
2 900 1500
2 200 210
3 5 15
4 10 20
#df2:
Gene Gene.Start Gene.End Chromosome
Gene1 10 60 1
Gene2 950 990 1
Gene3 8 14 3
I want to pull out/select rows in df2
that do not have a Gene.Start
and Gene.End
range where anything in the range falls in the ranges given in df1
in the Min
and Max
columns - with, importantly, the consideration of the Chromosome
number must also match.我想拉出/选择df2
中没有Gene.Start
和Gene.End
范围的行,其中该范围内的任何内容都落在Min
和Max
列中df1
给出的范围内 - 重要的是,考虑到Chromosome
数也必须匹配。
The expected output from the example would look like:示例中预期的 output 如下所示:
Gene Gene.Start Gene.End Chromosome
Gene2 950 990 1
Gene2
is the only gene/row with a start and end range that doesn't fall in any ranges with matching Chromosome
(looking at ranges in Chromosome 1) in df1
. Gene2
是唯一具有起始和结束范围的基因/行,该范围不属于与df1
中匹配的Chromosome
(查看染色体 1 中的范围)的任何范围。
To code this I am trying with data.table
but I'm not sure how to get the ranges to be considered like I want them to.要对此进行编码,我正在尝试使用data.table
但我不确定如何让范围被视为我想要的那样。
I've been trying to get this working but I'm not sure what I'm doing:我一直在努力让这个工作,但我不确定我在做什么:
df2[df1, match := i.Gene,
on = .(Chromosome, (df2$Gene.Start > & < df2$Gene.End) > Min, (df2$Gene.Start > & < df2$Gene.End) < Max)]
Error: unexpected '&'
What can I do to filter a dataframe by its ranges depending on ranges in another dataframe?我该怎么做才能根据另一个 dataframe 的范围按范围过滤 dataframe?
Example input data:示例输入数据:
df1 <- structure(list(Chromosome = c(1L, 1L, 2L, 2L, 2L, 3L, 4L), Min = c(10L,
450L, 20L, 900L, 200L, 5L, 10L), Max = c(500L, 550L, 100L, 1500L,
210L, 15L, 20L)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"))
df2 <- structure(list(Gene = c("Gene1", "Gene2", "Gene3"), Gene.Start = c(10L,
950L, 8L), Gene.End = c(60L, 990L, 14L), Chromosome = c(1L, 1L,
3L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"
))
Here is a data.table
approach这是data.table
方法
library(data.table)
# keep Gene that are not joined in the non-equi join on df1 below
df2[!Gene %in% df2[df1, on = .(Chromosome, Gene.Start >= Min, Gene.End <= Max)]$Gene, ]
# Gene Gene.Start Gene.End Chromosome
# 1: Gene2 950 990 1
Here is my try with dplyr
approach.这是我对dplyr
方法的尝试。 Please let me know.请告诉我。
library(dplyr)
library(tidyr)
df2 %>%
right_join(df1, by = "Chromosome") %>%
filter(Gene.Start<Min | Gene.Start>Max, Gene.End>Max | Gene.End>Min) %>%
distinct(Gene, Gene.Start, Gene.End, Chromosome, .keep_all = TRUE) %>%
select(Gene, Gene.Start, Gene.End, Chromosome)
Output: Output:
Gene Gene.Start Gene.End Chromosome
1 Gene2 950 990 1
The data.table
solution works best as it's the fastest on my much larger real data, but I did end up finding another solution with GenomicRanges
so I thought I'd also share for anyone else's future reference: data.table
解决方案效果最好,因为它在我更大的真实数据上是最快的,但我最终找到了另一个使用GenomicRanges
的解决方案,所以我想我也会分享给其他人以供将来参考:
library(GenomicRanges)
gr1 <- makeGRangesFromDataFrame(
data.frame(
chr=df1$Chromosome,
start=df1$Min,
end=df1$Max),
keep.extra.columns=TRUE)
gr2 <- makeGRangesFromDataFrame(
data.frame(
chr=df2$Chromosome,
start=df2$Gene.Start,
end=df2$Gene.End,
Gene = df2$Gene),
keep.extra.columns=TRUE)
no_overlaps <- gr2[-queryHits(findOverlaps(gr2, gr1, type="any")),]
no_overlap_genes <- unique(no_overlaps$Gene)
gene_matches <- df2[Gene %in% no_overlap_genes]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.