A function that can translate DNA sequence to binary code

Question

I am designing a function that can translate DNA sequence to binary code in four dimension vector. eg "A"-(1,0,0,0)| "G-(0,1,0,0)"...

We also find the () in for loop can actually influence the result. we hope to find the reason behind this. eg 4-1:7-1 & (4-1):7-1 is totally different, we want to find the knowledge behind this

NC1 <- function(data){ 
  for(i in 1:length(data) ){
    if(i==1){ 
      DCfirst <- unlist(as.vector(strsplit(data[1],"",fixed = TRUE)))
      DCsecond <- matrix(0,nrow = length(data),ncol = length(DCfirst))
      DCsecond[1,] <-  DCfirst 
    }else{
      DCsecond[i,] <- unlist(as.vector(strsplit(data[i],"",fixed = TRUE)))
    }
  }
  return(DCsecond)
}

binary<- function(data){
  sequence_X<-NC1(data)
  N=ncol(sequence_X)
  X2<-matrix(NA,nrow=length(data),ncol=4*N)
  for (i in 1 : N){
    L1<-which(sequence_X[,i]=="A")
    L2<-which(sequence_X[,i]=="G")
    L3<-which(sequence_X[,i]=="C")
    L4<-which(sequence_X[,i]=="U")
    for (j in L1){
      X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
    }
    for (j in L2){
      X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
    }
    for (j in L3){
      X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
    }
    for (j in L4){
      X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
    }
  }
    return (X2)
}

TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
binary(TEST)

The final result is showed us below:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
[1,]   NA   NA   NA   NA    1    0    0    0    1     0     0     0     1     0     0     0     1
[2,]   NA   NA   NA   NA    1    0    0    0    1     0     0     0     1     0     0     0     1
[3,]   NA   NA   NA   NA    1    0    0    0    1     0     0     0     1     0     0     0     1
[4,]   NA   NA   NA   NA    1    0    0    0    1     0     0     0     1     0     0     0     1
[5,]   NA   NA   NA   NA    1    0    0    0    1     0     0     0     1     0     0     0     1
     [,18] [,19] [,20]
[1,]     0     0     0
[2,]     0     0     0
[3,]     0     0     0
[4,]     0     0     0
[5,]     0     0     0

I hope my final sequence can all be translated to vector format. As can be seen from the results, all except the first element in each sequence cannot fully be translated to the vector format

this is the correct answer i hope to achieve:

this is the first time to use this to ask questions. I feel really sorry to be unable to convey the question clearly

Answer 1

Here is an option in base R with outer and == . We split the 'TEST' by "" , do the elementwise comparison to give a list of logical matric es

f1 <- function(x, y) outer(x, y, FUN = `==`)
lapply(strsplit(TEST, ""), f1, c("A", "G", "C", "U"))

data

TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")

Answer 2

I think I would do this in a lapply-like operation.

Example:

TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")

vecDNA <- function(x){unlist(strsplit(x = x, split = "*"))}
binDNA <- function(x){
  data.frame(
    code=x, 
    G=as.numeric(x=="G"), 
    C=as.numeric(x=="C"), 
    A=as.numeric(x=="A"), 
    U=as.numeric(x=="U")
  )
}

T2 <- lapply(as.list(TEST),vecDNA)
T3 <- lapply(T2, binDNA)
T3

Result:

> T3
[[1]]
  code G C A U
1    A 0 0 1 0
2    C 0 1 0 0
3    G 1 0 0 0
4    U 0 0 0 1
5    C 0 1 0 0

[[2]]
  code G C A U
1    A 0 0 1 0
2    C 0 1 0 0
3    U 0 0 0 1
4    A 0 0 1 0
5    U 0 0 0 1

[[3]]
  code G C A U
1    U 0 0 0 1
2    C 0 1 0 0
3    G 1 0 0 0
4    U 0 0 0 1
5    A 0 0 1 0

[[4]]
  code G C A U
1    C 0 1 0 0
2    G 1 0 0 0
3    U 0 0 0 1
4    C 0 1 0 0
5    G 1 0 0 0

[[5]]
  code G C A U
1    U 0 0 0 1
2    A 0 0 1 0
3    G 1 0 0 0
4    U 0 0 0 1
5    G 1 0 0 0

Answer 3

Here's a different approach, I created a multilevel list for each of your sequences coding the letters with stringr::str_locate_all() :

library(dplyr)
library(stringr)

TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")

coder <- function(string) {
  lapply(c("A","G","C","U"), function(x, y) {
    tmp <- rep(F, str_length(y))
    tmp[str_locate_all(y, x)[[1]][,1]] <- T
    tmp
  }, y = string) %>%
    setNames(c("A","G","C","U"))
}

dat <- lapply(TEST, coder) %>%
  setNames(TEST)

You can extract specific letters from a sequence with:

dat$ACGUC$G

[1] FALSE FALSE  TRUE FALSE FALSE

or a data frame with:

dat$ACGUC %>%
  bind_rows()

# A tibble: 5 x 4
  A     G     C     U    
  <lgl> <lgl> <lgl> <lgl>
1 TRUE  FALSE FALSE FALSE
2 FALSE FALSE TRUE  FALSE
3 FALSE TRUE  FALSE FALSE
4 FALSE FALSE FALSE TRUE 
5 FALSE FALSE TRUE  FALSE

A function that can translate DNA sequence to binary code

Question

3 answers

solution1
3 2019-07-03 06:01:10

data

solution2
2 2019-07-03 05:10:49

Example:

Result:

solution3
2 2019-07-03 05:20:48

A function that can translate DNA sequence to binary code

Question

3 answers

solution1 3 2019-07-03 06:01:10

data

solution2 2 2019-07-03 05:10:49

Example:

Result:

solution3 2 2019-07-03 05:20:48

solution1
3 2019-07-03 06:01:10

solution2
2 2019-07-03 05:10:49

solution3
2 2019-07-03 05:20:48