根据R中的多个arguments查找匹配元素

Question

I have a large data frame that looks like this.我有一个看起来像这样的大数据框。 I want to find which genes match the others based on an overlap between the start and end positions.我想根据开始位置和结束位置之间的重叠找到哪些基因与其他基因匹配。

library(tidyverse)

data <- data.frame(group=c(1,1,1,2,2,2),
                     genes=c("A","B","C","D","E","F"), 
                     start=c(1000,2000,3000,800,400,2000),
                     end=c(1500,2500,3500,1200,500,10000))

data
#>   group genes start   end
#> 1     1     A  1000  1500
#> 2     1     B  2000  2500
#> 3     1     C  3000  3500
#> 4     2     D   800  1200
#> 5     2     E   400   500
#> 6     2     F  2000 10000

^{Created on 2022-12-05 with reprex v2.0.2}^{创建于 2022-12-05，使用reprex v2.0.2}

I want something like this.我想要这样的东西。

data
#>   group genes start   end   match
#> 1     1     A  1000  1500    A-D
#> 2     1     B  2000  2500    B-F
#> 3     1     C  3000  3500    C-F
#> 4     2     D   800  1200    A-D
#> 5     2     E   400   500    NA
#> 6     2     F  2000 10000    F-C-B

I am a bit lost on how to start.我对如何开始有点迷茫。 Any help is appreciated任何帮助表示赞赏

Answer 1

With devel version of dplyr , we can use使用dplyr的开发版本，我们可以使用

library(dplyr)
library(stringr)
by <- join_by(overlaps(x$start, x$end, y$start, y$end))
full_join(data, data, by) %>% 
  group_by(genes= genes.x) %>% 
  summarise(match = if(n() ==1) NA_character_ else 
      str_c(genes.y, collapse = '-')) %>%
 left_join(data, .)

-output -输出

  group genes start   end match
1     1     A  1000  1500   A-D
2     1     B  2000  2500   B-F
3     1     C  3000  3500   C-F
4     2     D   800  1200   A-D
5     2     E   400   500  <NA>
6     2     F  2000 10000 B-C-F

根据R中的多个arguments查找匹配元素

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-12-04 23:27:55

根据R中的多个arguments查找匹配元素

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-12-04 23:27:55

解决方案1
2 已采纳 2022-12-04 23:27:55