简体   繁体   中英

Find the highest value in an interval defined by an other file

I have these two dataset: Before that contains 5 columns (chromsome, start, end, line number, score)

chrI         861         870    87     5
chrI         871         880    88    11
chrI         881         890    89    11
chrI         891         900    90    19
chrI         901         910    91    19
chrI         911         920    92    20
chrI         921         930    93    20
chrI         931         940    94    20
chrI         941         950    95    19
chrI         951         960    96    19
chrI         961         970    97    19
chrI         971         980    98    19
chrI         981         990    99    25
chrI         991        1000    100   20
chrI        1001        1010    101   20
chrI        1011        1020    102   20
chrI        1021        1030    103   20
chrI        1031        1040    104   15
chrI        1041        1050    105   14
chrI        1051        1060    106   14
chrI        1061        1070    107   13
chrI        1071        1080    108   13
chrI        1081        1090    109   13
chrI        1091        1100    110    7
chrI        1101        1110    111    7

Peaks that contains 4 columns (chromsome, start, end, value)

"chrI"  880     1091    383
"chrI"  1350    1601    302
"chrI"  1680    1921    241
"chrI"  2220    2561    322
"chrI"  2750    2761    18
"chrI"  3100    3481    420
"chrI"  3660    4211    793
"chrI"  4480    4491    20
"chrI"  4710    4871    195
"chrI"  5010    5261    238

For each lines of Peaks I would like to extract the corresponding lines (eg all the lines between 880 and 1091 for the first line) in Before , find the highest score value and write it on a new file. Output

  chrI       981         990    99    25

To this end, I've written this function:

summit <- function(x,y,output){
    y<- Before
    chrom <- x[1]
    start <-x[2]
    end <-x[3]
    startLine <- y[which((y$V1 == chrom) & (y$V2==start)),]
    endLine <- y[which((y$V1 == chrom) & (y$V3==end)),]
    Subset <- y[which((y$V2 >= startLine$V2) & (y$V3 <= endLine$V2))]
    maximum <- Subset[which(Subset$V4 == max(Subset$V4))]
    output <- print(maximum)
}

apply(Peaks,1,summit,output = 'peaks_list.bed')

I don't have an error message but It runs during the entire night without giving me results so I guess something is wrong with my code but I don't know what.

I also try this:

Peaks_Range <- GRanges(seqnames=Peaks$V1, ranges=IRanges(start=Peaks$V2, end=Peaks$V3))
Before_Range <- GRanges(seqnames=Before$V1, ranges=IRanges(start=Before$V2, end=Before$V3),score=Before$V5)
Merged <- mergeByOverlaps(Peaks_Range,Before_Range)
Merged <- as.data.frame(Merged)


for (i in 1:nrow(Peaks)){
        start <-Peaks[i,2]
        end <-Peaks[i,3]
        Subset <- subset(Merged,Merged$Peaks_Range.start == start)
        maximum <- as.numeric(max(Subset$score))
        summit <- Subset[which(Subset$score == maximum),]
        write.table(summit,'peaks_list.bed', sep="\t", append=T, col.name = FALSE, row.names = FALSE, quote = FALSE)

}

It works (I think) but this is very very slow so I search an alternative way to do it.

Does anyone have any idea?

You can use cut to help you get the index.

setwd("/home/wang/Downloads")

before <- read.table("before.txt", header = F, stringsAsFactors = F)
colnames(before) <- c("chromosome", "start", "end", "line number", "score")
peaks <- read.table("peaks.txt", header = F, stringsAsFactors = F, quote = "\"")
colnames(peaks) <- c("chromosome", "start", "end", "value")

summit <- function(peaks_vec){
  chromosome = peaks_vec[1]
  start = as.numeric(peaks_vec[2])
  end = as.numeric(peaks_vec[3])
  filter_before = subset(before, chromosome == chromosome)
  up_index = cut(end, filter_before[,2], labels = F) +1
  down_index = cut(start, filter_before[,2], labels = F) +1
  if(!is.na(down_index) & !is.na(up_index)){
    new_filter_before = filter_before[down_index : up_index, ]
    max_index = which.max(new_filter_before[,5])
    return(unlist(new_filter_before[max_index,]))
  }else {
    return(rep(NA, 5)) # you can input what you want.
  }
}
result <- t(apply(as.matrix(peaks), 1, summit))
remove_na_result <- as.data.frame(na.omit(result))
colnames(remove_na_result) <- colnames(before)

And the final output is :

  chromsome start end line number score
1      chrI   981 990          99    25

Wish my answer is helpful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM