简体   繁体   English

将向量中的起始位置映射到另一个向量中的停止位置

[英]Map start position in a vector to stop position in another vector

I have derived all the start and stop positions within a DNA string and now I would like to map each start position with each stop position, both of which are vectors and then use these positions to extract corresponding sub strings from the DNA string sequence.我已经导出了 DNA 字符串中的所有开始和停止位置,现在我想将每个开始位置与每个停止位置映射,这两个位置都是向量,然后使用这些位置从 DNA 字符串序列中提取相应的子字符串。 But I am unable to efficiently loop through both vectors to achieve this, especially as they are not of the same length.但是我无法有效地循环遍历两个向量来实现这一点,尤其是因为它们的长度不同。

I have tried different versions of loops (for, ifelse) but I am not quite able to wrap my head around a solution yet.我尝试了不同版本的循环(for、ifelse),但我还不能完全理解解决方案。

Here is an example of one of my several attempts at solve this problem.这是我解决此问题的多次尝试之一的示例。

new = data.frame()
for (i in start_pos){
  for (j in stop_pos){
    while (j>i){
      new[j,1]=i
      new[j,2]=j
    }
     }
}

Here is an example of my desired result: start = c(1,5,7, 9, 15) stop = c(4, 13, 20, 30, 40, 50).这是我想要的结果的一个例子:start = c(1,5,7, 9, 15) stop = c(4, 13, 20, 30, 40, 50)。 My desired result would ideally be a dataframe of two columns mapping each start to its stop position.我想要的结果理想情况下是一个两列的数据框,将每个开始位置映射到其停止位置。 I only want to add rows on to df where by start values are greater than its corresponding stop values (multiple start values can have same stop values as long as it fulfills this criteria)as shown in my example below.我只想在 df 上添加行,其中 by 起始值大于其相应的停止值(只要满足此条件,多个起始值可以具有相同的停止值),如下面的示例所示。

 i.e first row df= (1,4)
    second row df= (5,13)
    third row df = (7, 13 )
    fourth row df = (9,13)
    fifth row df =  (15, 20)

Here is a possible tidyverse solution:这是一个可能的tidyverse解决方案:

library(purrr)
library(plyr)
library(dplyr)

The map2 is used to map values of the two vectors(start and stop). map2用于映射两个向量(开始和停止)的值。 We then make one vector out of these followed by unlist ing and combining our results into a data.frame object.然后我们从这些中创建一个向量,然后unlist并将我们的结果组合到一个data.frame对象中。

EDIT : With the updated condition, we can do something like:编辑:使用更新的条件,我们可以执行以下操作:

start1= c(118,220, 255) 
stop1 =c(115,210,260)
res<-purrr::map2(start1[1:length(stop1)],stop1,function(x,y) c(x,y[y>x]))
res[unlist(lapply(res,function(x) length(x)>1))]
   # [[1]]
   # [1] 255 260

ORIGINAL :原文

plyr::ldply(purrr::map2(start[1:length(stop)],stop,function(x,y) c(x,y)),unlist) %>% 
   setNames(nm=c("start","stop")) %>% 
 mutate(newCol=paste0("(",start,",",stop,")"))
#  start stop  newCol
#1     1    4   (1,4)
#2     5   13  (5,13)
#3    15   20 (15,20)
#4    NA   30 (NA,30)
#5    NA   40 (NA,40)
#6    NA   50 (NA,50)

Alternative: A clever way is shown by @Marius .The key is to have corresponding lengths.替代方案:@Marius 展示了一个聪明的方法。关键是要有相应的长度。

plyr::ldply(purrr::map2(start,stop[1:length(start)],function(x,y) c(x,y)),unlist) %>% 
   setNames(nm=c("start","stop")) %>% 
 mutate(newCol=paste0("(",start,",",stop,")"))
  start stop  newCol
1     1    4   (1,4)
2     5   13  (5,13)
3    15   20 (15,20)

Here's a fairly simple solution - it's probably good not to over-complicate things unless you're sure you need the extra complexity.这是一个相当简单的解决方案 - 除非您确定需要额外的复杂性,否则不要将事情过度复杂化可能是好的。 The starts and stops already seem to be matched up, you just might have more of one than the other, so you can find the length of the shortest vector and only use that many items from start and stop :开始和停止似乎已经匹配了,您可能只有一个比另一个多,因此您可以找到最短向量的长度,并且只使用startstop中的许多项:

start = c(1, 5, 15) 
stop = c(4, 13, 20, 30, 40, 50)

min_length = min(length(start), length(stop))

df = data.frame(
    start = start[1:min_length],
    stop = stop[1:min_length]
)

EDIT: after reading some of your comments here, it looks like your problem actually is more complicated than it first seemed (coming up with examples that demonstrate the level of complexity you need, without being overly complex, is always tricky).编辑:阅读你的一些评论在这里后,它看起来像你的问题实际上比它更复杂第一似乎(未来与演示需要的复杂程度例子,而不过于复杂,始终是棘手的)。 If you want to match each start with the next stop that's greater than the start, you can do:如果您想将每个起点与大于起点的下一站匹配,您可以执行以下操作:

# Slightly modified example: multiple starts
#   that can be matched with one stop
start = c(1, 5, 8)
stop = c(4, 13, 20, 30, 40, 50)

df2 = data.frame(
    start = start,
    stop = sapply(start, function(s) { min(stop[stop > s]) })
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM