简体   繁体   中英

Find which column ranges overlap after grouping in R

I have a huge data frame that looks like this.

I want to group_by(chr) , and then for each chr to find

  • Is any range1 (start1, end1), within the chr group, overlapping with any range2 (start2,end2)?
library(dplyr)

df1 <- tibble(chr=c(1,1,2,2),
               start1=c(100,200,100,200),
               end1=c(150,400,150,400),
       species=c("Penguin"), 
       start2=c(200,200,500,1000), 
       end2=c(250,240,1000,2000)
       )

df1
#> # A tibble: 4 × 6
#>     chr start1  end1 species start2  end2
#>   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl>
#> 1     1    100   150 Penguin    200   250
#> 2     1    200   400 Penguin    200   240
#> 3     2    100   150 Penguin    500  1000
#> 4     2    200   400 Penguin   1000  2000

Created on 2023-01-05 with reprex v2.0.2

I want my data to look like this. Essentially I want to check if the range2 overlaps with any range1. The new data does not change the question, but proof checks the code

# A tibble: 4 × 6
        chr start1  end1 species start2  end2 OVERLAP
         1    100   150 Penguin    200   250    TRUE
         1    200   400 Penguin    200   240    TRUE
         2    100   150 Penguin    500  1000    FALSE
         2    200   400 Penguin   1000  2000    FALSE

I have fought a lot with the ivs package and iv_overlaps with no success in getting what I want.

Major EDIT:


When I apply any of the codes in real data, I am not getting the results I want, and I am so confused. Why? The new data dataset does not change the question, but proofs check the code

data <- tibble::tribble(
  ~chr, ~start1, ~end1, ~strand, ~gene, ~start2, ~end2,
  "Chr2",   2739,   2840, "+", "A",    740,   1739,
  "Chr2",  12577,  12678, "+", "B",  10578,  11577,
  "Chr2",  22431,  22532, "+", "C",  20432,  21431,
  "Chr2",  32202,  32303, "+", "D",  30203,  31202,
  "Chr2",  42024,  42125, "+", "E",  40025,  41024,
  "Chr2",  51830,  51931, "+", "F",  49831,  50830,
  "Chr2",  82061,  84742, "+", "G",  80062,  81061,
  "Chr2",  84811,  86692, "+", "H",  82812,  83811,
  "Chr2",  86782,  88106, "-", "I",  88107,  89106,
  "Chr2", 139454, 139555, "+", "J", 137455, 138454,
  )

data %>% 
  group_by(chr) %>% 
  mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))

then It gives as an output

 chr   start1   end1 strand gene  start2   end2 overlap
   <chr>  <dbl>  <dbl> <chr>  <chr>  <dbl>  <dbl> <lgl>  
 1 Chr2    2739   2840 +      A        740   1739 TRUE   
 2 Chr2   12577  12678 +      B      10578  11577 TRUE   
 3 Chr2   22431  22532 +      C      20432  21431 TRUE   
 4 Chr2   32202  32303 +      D      30203  31202 TRUE   
 5 Chr2   42024  42125 +      E      40025  41024 TRUE   
 6 Chr2   51830  51931 +      F      49831  50830 TRUE   
 7 Chr2   82061  84742 +      G      80062  81061 TRUE   
 8 Chr2   84811  86692 +      H      82812  83811 TRUE   
 9 Chr2   86782  88106 -      I      88107  89106 TRUE   
10 Chr2  139454 139555 +      J     137455 138454 TRUE

Which is wrong. They might be indirect matches, but there there is not a direct overlap.

Scenario 1: Element-wise detection for overlapping

library(dplyr)

df1 %>%
  group_by(chr) %>%
  mutate(OVERLAP = any(start1 <= end2 & end1 >= start2)) %>%
  ungroup()

# # A tibble: 4 × 7
#     chr start1  end1 species start2  end2 OVERLAP
#   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
# 1     1    100   150 Penguin    200   250 TRUE   
# 2     1    200   400 Penguin    200   240 TRUE   
# 3     2    100   150 Penguin    500  1000 FALSE  
# 4     2    200   400 Penguin   1000  2000 FALSE

Scenario 2: Element-wise detection for overlapping with sorting

If the intervals are directed, ie end can be less than start , then you need to do sorting before determine overlaps.

df1 %>%
  group_by(chr) %>%
  mutate(OVERLAP = any(pmin(start1, end1) <= pmax(start2, end2) &
                       pmax(start1, end1) >= pmin(start2, end2)))

Scenario 3: Cross detection for overlapping with sorting

Furthermore, if you want to check if an interval (start1, end1) overlaps any of the intervals (start2, end2) , as which ivs::iv_overlaps() works, then you can implement it with purrr::map2 .

df1 %>%
  group_by(chr) %>%
  mutate(OVERLAP = any(
    purrr::map2_lgl(start1, end1,
                    ~ any(min(.x, .y) <= pmax(start2, end2) &
                          max(.x, .y) >= pmin(start2, end2)))
  ))

There are several interpretations to your questions, so here are three possible cases:

  1. Within a group, detect for each [start1, end1] if they overlap with any of [start2, end2] .
  2. Within a group, detect if any of [start1, end1] overlap with any of [start2, end2] .
  3. Within a group, detect if each of [start1, end1] overlap with their corresponding [start2, end2] (the one on the same row).

In the three cases, you can use ivs::iv_overlaps .


Case 1

iv_overlaps will detect, within each group, if the intervals defined in [start1, end1] overlap in any way with any of the intervals [start2, end2] . It'll return a logical vector of the length of [start1, end1] .

library(ivs)
library(dplyr)
df1 %>% 
  group_by(chr) %>% 
  mutate(overlap = iv_overlaps(iv(start1, end1), iv(start2, end2)))

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 FALSE  
2     1    200   400 Penguin    160   170 TRUE   
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE  

Case 2

If you want to know if any (not each) of the intervals 1 overlaps with any of the intervals 2 (so one unique value per group), you should add any :

df1 %>% 
  group_by(chr) %>% 
  mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 TRUE   
2     1    200   400 Penguin    160   170 TRUE   
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE  

Case 3

If you want rowwise overlap detection, then you should use map2 with iv_overlaps :

df1 %>% 
  group_by(chr) %>% 
  mutate(overlap = map2_lgl(iv(start1, end1), iv(start2, end2), iv_overlaps))

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 FALSE  
2     1    200   400 Penguin    160   170 FALSE  
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE  

Order of the comparison

Indeed, if one wants to compare the second intervals with the first, one should use iv_overlaps(interval2, interval1) :

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 TRUE   
2     1    200   400 Penguin    160   170 FALSE  
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE  

Data

df1 <- tibble(chr=c(1,1,2,2),               start1=c(100,200,100,200),               end1=c(150,400,150,400),               species=c("Penguin"),                start2=c(200,160,500,1000),                end2=c(250,170,1000,2000) )

If you want to check whether the overlap occurs in either direction, you need:

df1 %>%
  group_by(chr) %>%
  mutate(overlap = (max(end1) > min(start2) & min(start2) > min(start1))|
                   (max(end2) > min(start1) & min(start1) > min(start2))) 
#> # A tibble: 4 x 7
#> # Groups:   chr [2]
#>     chr start1  end1 species start2  end2 overlap
#>   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
#> 1     1    100   150 Penguin    200   250 TRUE   
#> 2     1    200   400 Penguin    200   240 TRUE   
#> 3     2    100   150 Penguin    500  1000 FALSE  
#> 4     2    200   400 Penguin   1000  2000 FALSE

Created on 2023-01-05 with reprex v2.0.2

If your definition of overlap is not overlap as in Darren's answer https://stackoverflow.com/a/75021631/11732165 but containment ((start1 >= start2 & end1 <= end2) | (start2 >= start1 & end2 <= end1)) then take the answer and use the logic you want.

I use a cross join to make sure you compare all rows under the same chr .

Unfortunately there IS undeniably a full containment in your test data -

 chr   start1   end1 strand gene  start2   end2 overlap
 7 Chr2   82061  84742 +      G      80062  81061 TRUE   
 8 Chr2   84811  86692 +      H      82812  83811 TRUE   

[start2, end2] for H is contained in [start1, end1] for G.

Code (note that performance will degrade rapidly if there are a lot of records under a single chr - over 200 is likely to be intolerable, and you'll want an implementation that doesn't involve a self-cross.

check_overlap = function(df){
  df %>% mutate(temp_id = 1:nrow(df)) %>%
    inner_join(., ., by='chr') %>%
    filter(temp_id.x != temp_id.y) %>%
    mutate(overlaps = start1.x <= end2.y & end1.x >= start2.y) %>%
    group_by(chr) %>%
    summarise(OVERLAP = any(overlaps)) %>%
    inner_join(df, by = 'chr')
}

check_containment = function(df){
  df %>% mutate(temp_id = 1:nrow(df)) %>%
    inner_join(., ., by='chr') %>%
    filter(temp_id.x != temp_id.y) %>%
    mutate(overlaps = (start1.x >= start2.y & end1.x <= end2.y) | (start2.y >= start1.x & end2.y <= end1.x)) %>%
    group_by(chr) %>%
    summarise(OVERLAP = any(overlaps)) %>%
    inner_join(df, by = 'chr')
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM