[英]Find matching elements based on multiple arguments in R
I have a large data frame that looks like this.我有一个看起来像这样的大数据框。 I want to find which genes match the others based on an overlap between the start and end positions.我想根据开始位置和结束位置之间的重叠找到哪些基因与其他基因匹配。
library(tidyverse)
data <- data.frame(group=c(1,1,1,2,2,2),
genes=c("A","B","C","D","E","F"),
start=c(1000,2000,3000,800,400,2000),
end=c(1500,2500,3500,1200,500,10000))
data
#> group genes start end
#> 1 1 A 1000 1500
#> 2 1 B 2000 2500
#> 3 1 C 3000 3500
#> 4 2 D 800 1200
#> 5 2 E 400 500
#> 6 2 F 2000 10000
Created on 2022-12-05 with reprex v2.0.2创建于 2022-12-05,使用reprex v2.0.2
I want something like this.我想要这样的东西。
data
#> group genes start end match
#> 1 1 A 1000 1500 A-D
#> 2 1 B 2000 2500 B-F
#> 3 1 C 3000 3500 C-F
#> 4 2 D 800 1200 A-D
#> 5 2 E 400 500 NA
#> 6 2 F 2000 10000 F-C-B
I am a bit lost on how to start.我对如何开始有点迷茫。 Any help is appreciated任何帮助表示赞赏
With devel version of dplyr
, we can use使用dplyr
的开发版本,我们可以使用
library(dplyr)
library(stringr)
by <- join_by(overlaps(x$start, x$end, y$start, y$end))
full_join(data, data, by) %>%
group_by(genes= genes.x) %>%
summarise(match = if(n() ==1) NA_character_ else
str_c(genes.y, collapse = '-')) %>%
left_join(data, .)
-output -输出
group genes start end match
1 1 A 1000 1500 A-D
2 1 B 2000 2500 B-F
3 1 C 3000 3500 C-F
4 2 D 800 1200 A-D
5 2 E 400 500 <NA>
6 2 F 2000 10000 B-C-F
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.