简体   繁体   中英

Find value of overlapping ranges of integers in R that are NOT times or genomes

I'm trying to calculate overlapping depth ranges for marine species and human activities. So for each species, there's a min and max depth it occurs at, and I want to efficiently calculate the depth range the overlaps with the depth range of 4 different activities. I think this can be done with data.table::foverlaps() or IRanges::findOverlaps() , but I can't figure out how to calculate the value of the overlap, not just whether it's true or false. So if species D is found between 40-100m depth, and activity 1 occurs at 0-50m depth, the overlap is 10m.

For example,

min_1 <- 0 
max_1 <- 50
min_2 <- 0 
max_2 <- 70
min_3 <- 0
max_3 <- 200
min_4 <- 0
max_4 <- 500

activities <- data.frame(min_1, max_1, min_2, max_2, min_3, max_3, min_4, max_4)

spp_id <- c("a", "b", "c", "d")
spp_depth_min <- c(0, 20, 30, 40)
spp_depth_max <- c(200, 500, 50, 100)

species <- data.frame(spp_id, spp_depth_min, spp_depth_max)

## data.table approach?

setDT(activities)
setDT(species)

foverlaps(species, activities, ...) ## Or do I need to subset each activity and do separate calculations? 

Would it be easier to write a function? I'm really unfamiliar with that, This seems like it should be a common/easy thing to do, I don't know why it's confusing me so much

I restructured your activities table into a long form so you can do all 4 calculations at once. Then the overlaps join is done, then you can calculate the overlap length from the results.

activities <- data.table(
  act = c('act_1','act_2','act_3','act_4'),
  a_min = c(min_1, min_2, min_3, min_4),
  a_max = c(max_1, max_2, max_3, max_4)
  )

spp_id <- c("a", "b", "c", "d")
spp_depth_min <- c(0, 20, 30, 40)
spp_depth_max <- c(200, 500, 50, 100)

species <- data.table(spp_id, spp_depth_min, spp_depth_max)

setkey(activities,a_min,a_max)

ol <- foverlaps(species, activities, 
  by.x = c('spp_depth_min','spp_depth_max'), 
  by.y = c('a_min','a_max')
  )
ol[,ol_length := pmin(spp_depth_max,a_max)-pmax(spp_depth_min,a_min)]
ol

For the sake of completeness, here is a version which uses a non-equi join instead of calling the foverlaps() function to find the overlaps. In addition, the original data as provided by the OP are used, ie, activities in wide format.

library(data.table)
vals <- c("min", "max")
melt(setDT(activities), measure.vars = patterns(vals), variable.name = "activity", value.name = vals)[
  setDT(species), on = .(max >= spp_depth_min, min <= spp_depth_max), 
    .(activity, spp_id, overlap = pmin(x.max, spp_depth_max) - pmax(x.min, spp_depth_min))]
 activity spp_id overlap 1: 1 a 50 2: 2 a 70 3: 3 a 200 4: 4 a 200 5: 1 b 30 6: 2 b 50 7: 3 b 180 8: 4 b 480 9: 1 c 20 10: 2 c 20 11: 3 c 20 12: 4 c 20 13: 1 d 10 14: 2 d 30 15: 3 d 60 16: 4 d 60

Explanation

melt() is used to reshape activities from wide to long format with multiple measure variables simultaneously.

The condition for the non-equi join can be derived by some Boolean algebra as follows:

Two closed intervals [a 1 , a 2 ] and [b 1 , b 2 ] do not overlap if

b 2 < a 1 OR a 2 < b 1

The two intervals do overlap if this Boolean expression is negated:

NOT( b 2 < a 1 OR a 2 < b 1 ) =

NOT( b 2 < a 1 ) AND NOT( a 2 < b 1 ) =

b 2 >= a 1 AND a 2 >= b 1

The join identifies all pairs of intervals which do overlap (which are 4 x 4 = 16 cases for the given dataset). The start point of the overlapping area of each pair is given by the larger start point of the two intervals and the end point is given by the smaller end point of the two intervals. The length of the overlap is the difference between the limit points.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM