I am designing a function that can translate DNA sequence to binary code in four dimension vector. eg "A"-(1,0,0,0)| "G-(0,1,0,0)"...
We also find the () in for loop can actually influence the result. we hope to find the reason behind this. eg 4-1:7-1 & (4-1):7-1 is totally different, we want to find the knowledge behind this
NC1 <- function(data){
for(i in 1:length(data) ){
if(i==1){
DCfirst <- unlist(as.vector(strsplit(data[1],"",fixed = TRUE)))
DCsecond <- matrix(0,nrow = length(data),ncol = length(DCfirst))
DCsecond[1,] <- DCfirst
}else{
DCsecond[i,] <- unlist(as.vector(strsplit(data[i],"",fixed = TRUE)))
}
}
return(DCsecond)
}
binary<- function(data){
sequence_X<-NC1(data)
N=ncol(sequence_X)
X2<-matrix(NA,nrow=length(data),ncol=4*N)
for (i in 1 : N){
L1<-which(sequence_X[,i]=="A")
L2<-which(sequence_X[,i]=="G")
L3<-which(sequence_X[,i]=="C")
L4<-which(sequence_X[,i]=="U")
for (j in L1){
X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
}
for (j in L2){
X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
}
for (j in L3){
X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
}
for (j in L4){
X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
}
}
return (X2)
}
TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
binary(TEST)
The final result is showed us below:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
[1,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[2,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[3,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[4,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[5,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[,18] [,19] [,20]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
[4,] 0 0 0
[5,] 0 0 0
I hope my final sequence can all be translated to vector format. As can be seen from the results, all except the first element in each sequence cannot fully be translated to the vector format
this is the correct answer i hope to achieve:
this is the first time to use this to ask questions. I feel really sorry to be unable to convey the question clearly
Here is an option in base R
with outer
and ==
. We split the 'TEST' by ""
, do the elementwise comparison to give a list
of logical matric
es
f1 <- function(x, y) outer(x, y, FUN = `==`)
lapply(strsplit(TEST, ""), f1, c("A", "G", "C", "U"))
TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
I think I would do this in a lapply-like operation.
TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
vecDNA <- function(x){unlist(strsplit(x = x, split = "*"))}
binDNA <- function(x){
data.frame(
code=x,
G=as.numeric(x=="G"),
C=as.numeric(x=="C"),
A=as.numeric(x=="A"),
U=as.numeric(x=="U")
)
}
T2 <- lapply(as.list(TEST),vecDNA)
T3 <- lapply(T2, binDNA)
T3
> T3
[[1]]
code G C A U
1 A 0 0 1 0
2 C 0 1 0 0
3 G 1 0 0 0
4 U 0 0 0 1
5 C 0 1 0 0
[[2]]
code G C A U
1 A 0 0 1 0
2 C 0 1 0 0
3 U 0 0 0 1
4 A 0 0 1 0
5 U 0 0 0 1
[[3]]
code G C A U
1 U 0 0 0 1
2 C 0 1 0 0
3 G 1 0 0 0
4 U 0 0 0 1
5 A 0 0 1 0
[[4]]
code G C A U
1 C 0 1 0 0
2 G 1 0 0 0
3 U 0 0 0 1
4 C 0 1 0 0
5 G 1 0 0 0
[[5]]
code G C A U
1 U 0 0 0 1
2 A 0 0 1 0
3 G 1 0 0 0
4 U 0 0 0 1
5 G 1 0 0 0
Here's a different approach, I created a multilevel list for each of your sequences coding the letters with stringr::str_locate_all()
:
library(dplyr)
library(stringr)
TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
coder <- function(string) {
lapply(c("A","G","C","U"), function(x, y) {
tmp <- rep(F, str_length(y))
tmp[str_locate_all(y, x)[[1]][,1]] <- T
tmp
}, y = string) %>%
setNames(c("A","G","C","U"))
}
dat <- lapply(TEST, coder) %>%
setNames(TEST)
You can extract specific letters from a sequence with:
dat$ACGUC$G
[1] FALSE FALSE TRUE FALSE FALSE
or a data frame with:
dat$ACGUC %>%
bind_rows()
# A tibble: 5 x 4
A G C U
<lgl> <lgl> <lgl> <lgl>
1 TRUE FALSE FALSE FALSE
2 FALSE FALSE TRUE FALSE
3 FALSE TRUE FALSE FALSE
4 FALSE FALSE FALSE TRUE
5 FALSE FALSE TRUE FALSE
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.