Imagine I have this dataframe:
position coverage
1 30
2 2
3 1
4 8
5 2
6 3
7 20
8 40
. .
. .
100 0
101 2
102 3
103 4
104 1
105 40
I would like to get those regions where, along at least 4 positions, the coverage is below a value of 10.
In this case, my desired output is:
start end
2 6
100 104
I was trying a for loop, but I don't know how to build it to work with a group of rows instead of row by row. Do you know how can I achieve this output?
We can use rleid
from data.table
. Created a grouping index based on the 'coverage' values less than 10, subset the 'position' where the number of rows are greater than or equal to 4 and all
of 'coverage' are less than 10, then use the 'grp' to get the first
and last
element of the 'position'
library(data.table)
setDT(df1)[, position[.N >=4 & all(coverage < 10)],
.(grp = rleid(coverage < 10))][,
.(start = first(V1), end = last(V1)), grp][, grp := NULL][]
# start end
#1: 2 6
#2: 100 104
Or with dplyr
library(dplyr)
df1 %>%
group_by(grp = rleid(coverage < 10)) %>%
filter(all(coverage < 10), n() >=4) %>%
group_by(grp) %>%
summarise(start = first(position), end = last(position)) %>%
select(-grp)
Or with rle
from base R
rl <- rle(df1$coverage < 10)
do.call(rbind, lapply(split(df1$position,
rep(seq_along(rl$values), rl$lengths)), range)[rl$values & rl$lengths >= 4])
df1 <- structure(list(position = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 100L,
101L, 102L, 103L, 104L, 105L), coverage = c(30L, 2L, 1L, 8L,
2L, 3L, 20L, 40L, 0L, 2L, 3L, 4L, 1L, 40L)), class = "data.frame",
row.names = c(NA,
-14L))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.