简体   繁体   English

R数据帧操作

[英]R data frame manipulation

Suppose I have a data frame that looks like this. 假设我有一个看起来像这样的数据框。

#   start  end  motif
#       2    6      a
#      10   15      b
#      30   35      c

How would I create a data frame that fills in the remaining start and end locations like so up to a certain number Max_end : 我将如何创建一个数据框,以填充剩余的开始和结束位置(例如,直到一定数量的Max_end

Max_end <- 33

#   start  end  motif
#       0    2     na          # <- 0-2 are filled in because it is not in the original data frame
#       2    6      a          # <- 2-6 are in the original
#       6   10     na          # <- 6-10 is not
#      10   15      b          # <- 10-15 is
#      15   30     na          # and so on
#      30   33      c

And further, calculates the distance between the start and end locations and creates a one column data frame. 此外,计算起点和终点之间的距离并创建一列数据框。

#   Length  motif
#        2     na
#        4      a
#        4     na
#        5      b
#       15     na
#        3      c

Currently this is how i am doing it: It is very inefficient 目前,这就是我的操作方式:效率很低

library(data.table)
library(stringi)

f <- fread('ABC.txt',header=F,skip=1)$V1
f <- paste(f, collapse = "")

motifs = c('GATC', 'CTGCAG', 'ACCACC', 'CC(A|T)GG', 'CCAC.{8}TGA(C|T)')

v <- na.omit(data.frame(do.call(rbind, lapply(stri_locate_all_regex(f, motifs), unlist))))
v <- v[order(v[,1]),]
v2difference <- "blah"

for(i in 2:nrow(v)){
  if(v[i,1] > v[i-1,2]+2){v2difference[i] <- v[i,1]-v[i-1,2]-2} 
}
v2difference[1] <- v[1,1]
v2 <- data.frame(Order=seq(1, 2*nrow(v), 2),Lengths=matrix(v2difference, ncol = 1),Motifs="na")
v1 <- data.frame(Order=seq(2, 2*nrow(v), 2),Lengths=(v$end-v$start+1),Motifs=na.omit(unlist(stri_extract_all_regex(f,motifs))))
V <- data.frame(Track=1,rbind(v1,v2))
V <- V[order(V$Order),]
B <- V[,!(names(V) %in% "Order")]
Max_end <- 33

breaks <- c(0, t(as.matrix(dat[,1:2])), Max_end)  # get endpoints
breaks <- breaks[breaks <= Max_end]
merge(dat, data.frame(start=breaks[-length(breaks)], end=breaks[-1]), all=T)

# start end motif
# 1     0   2  <NA>
# 2     2   6     a
# 3     6  10  <NA>
# 4    10  15     b
# 5    15  30  <NA>
# 6    30  33  <NA>
# 7    30  35     c

To specify a start and endpoint, you could do 要指定起点和终点,您可以执行

Max_end <- 33
Max_start <- 10
breaks <- unique(c(Max_start, t(as.matrix(dat[,1:2])), Max_end))
breaks <- breaks[breaks <= Max_end & breaks >= Max_start]

merge(dat, data.frame(start=breaks[-length(breaks)], end=breaks[-1]), all.y=T)

#   start end motif
# 1    10  15     b
# 2    15  30  <NA>
# 3    30  33  <NA>

Note: this doesn't include "c" in the shortened final interval, you would need to decide if that values gets included or not when the interval changes. 注意:缩短的最终间隔中不包括“ c”,您需要确定间隔更改时是否包含该值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM