简体   繁体   English

如何使用R中的条件/ for循环将单列数据转换为双列矩阵

[英]How to convert single column data into two-column matrix using conditional/for loop in R

I have a single column data frame - example data: 我有一个单列数据框 - 示例数据:

1                          >PROKKA_00002 Alpha-ketoglutarate permease
2        MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT
3        QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG
4                                          >PROKKA_00003 lipoprotein
5       MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG

Each sequence of letters is associated with the ">" line above it. 每个字母序列与其上方的“>”行相关联。 I need a two-column data frame with lines starting in ">" in the first column, and the respective lines of letters concatenated as one sequence in the second column. 我需要一个双列数据框,第一列中的行以“>”开头,第二列中的相应字母串连接为一个序列。 This is what I've tried so far: 这是我到目前为止所尝试的:

 y <- matrix(0,5836,2) #empty matrix with 5836 rows and two columns
 z <- 0
 for(i in 1:nrow(df)){
   if((grepl(pattern = "^>", x = df)) == TRUE){ #tried to set the conditional "if a line starts with ">", execute code"
     z <- z + 1
     y[z,1] <- paste(df[i])
     } else{
     y[z,2] <- paste(df[i], collapse = "")
     }
 }

I would eventually convert the matrix y back to a data.frame using as.data.frame, but my loop keeps getting Error: unexpected '}' in "}". 我最终会使用as.data.frame将矩阵y转换回data.frame,但是我的循环在“}”中不断获得错误:意外'}'。 I'm also not sure if my conditional is right. 我也不确定我的条件是否合适。 Can anyone help? 有人可以帮忙吗? It would be greatly appreciated! 这将不胜感激!

Although I will stick with packages, here is a solution 虽然我会坚持使用包,但这是一个解决方案

initialize data 初始化数据

mydf <- data.frame(x=c(">PROKKA_00002 Alpha-ketoglutarate","MTESSITERGAPEL", "MTESSITERGAPEL",">PROKKA_00003 lipoprotein", "MTESSITERGAPEL"   ,"MRTIIVIASLLLT"), stringsAsFactors = F)

process 处理

ind <- grep(">", mydf$x)
temp<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], nrow(mydf)))

seqs<-rep(NA, length(ind))
for(i in 1:length(ind)) {
  seqs[i]<-paste(mydf$x[temp$from[i]:temp$to[i]], collapse="")
}

fastatable<-data.frame(name=gsub(">", "", mydf[ind,1]), sequence=seqs)


> fastatable
                              name                     sequence
1 PROKKA_00002 Alpha-ketoglutarate MTESSITERGAPELMTESSITERGAPEL
2         PROKKA_00003 lipoprotein  MTESSITERGAPELMRTIIVIASLLLT

Try creating an index of the rows with the target symbol with the column headers. 尝试使用带有列标题的目标符号创建行的索引。 Then split the data on that index. 然后拆分该索引上的数据。 The call cumsum(ind1)[!ind1] first creates an id rows by coercing the logical vector into numeric, then eliminates the rows with the column headers. 调用cumsum(ind1)[!ind1]首先通过将逻辑向量强制转换为数字来创建id行,然后使用列标题删除行。

ind1 <- grepl(">", mydf$x)

#split data on the index created
newdf <- data.frame(mydf$x[ind1][cumsum(ind1)], mydf$x)[!ind1,]

#Add names
names(newdf) <- c("Name", "Value")
newdf
#            Name               Value
# 2 >PROKKA_00002 Alpha-ketoglutarate
# 3 >PROKKA_00002      MTESSITERGAPEL
# 5 >PROKKA_00003         lipoprotein
# 6 >PROKKA_00003       MRTIIVIASLLLT

Data 数据

mydf <- data.frame(x=c(">PROKKA_00002","Alpha-ketoglutarate","MTESSITERGAPEL", ">PROKKA_00003", "lipoprotein"   ,"MRTIIVIASLLLT"))

You can use plyr to accomplish this if you are able to assigned a section number to your rows appropriately: 如果能够适当地为行指定节号,则可以使用plyr完成此操作:

library(plyr)
df <- data.frame(v1=c(">PROKKA_00002 Alpha-ketoglutarate permease",
                   "MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT",
                   "QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG",
                   ">PROKKA_00003 lipoprotein",
                   "MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG"))
df$hasMark <- ifelse(grepl(">",df$v1,fixed=TRUE),1, 0)
df$section <- cumsum(df$hasMark)

t <- ddply(df, "section", function(x){
  data.frame(v2=head(x,1),v3=paste(x$v1[2:nrow(x)], collapse=''))
})

t <- subset(t, select=-c(section,v2.hasMark,v2.section)) #drop the extra columns

if you then view 't' I believe this is what you were looking for in your original post 如果你再看't'我相信这就是你在原帖中寻找的东西

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM