简体   繁体   English

可以将DNA序列转换为二进制代码的函数

[英]A function that can translate DNA sequence to binary code

I am designing a function that can translate DNA sequence to binary code in four dimension vector. 我正在设计一个可以将DNA序列转换为四维矢量二进制代码的函数。 eg "A"-(1,0,0,0)| 例如“A” - (1,0,0,0)| "G-(0,1,0,0)"... “G-(0,1,0,0)” ...

We also find the () in for loop can actually influence the result. 我们还发现for循环中的()实际上可以影响结果。 we hope to find the reason behind this. 我们希望找到这背后的原因。 eg 4-1:7-1 & (4-1):7-1 is totally different, we want to find the knowledge behind this 例如4-1:7-1和(4-1):7-1完全不同,我们想找到背后的知识

NC1 <- function(data){ 
  for(i in 1:length(data) ){
    if(i==1){ 
      DCfirst <- unlist(as.vector(strsplit(data[1],"",fixed = TRUE)))
      DCsecond <- matrix(0,nrow = length(data),ncol = length(DCfirst))
      DCsecond[1,] <-  DCfirst 
    }else{
      DCsecond[i,] <- unlist(as.vector(strsplit(data[i],"",fixed = TRUE)))
    }
  }
  return(DCsecond)
}

binary<- function(data){
  sequence_X<-NC1(data)
  N=ncol(sequence_X)
  X2<-matrix(NA,nrow=length(data),ncol=4*N)
  for (i in 1 : N){
    L1<-which(sequence_X[,i]=="A")
    L2<-which(sequence_X[,i]=="G")
    L3<-which(sequence_X[,i]=="C")
    L4<-which(sequence_X[,i]=="U")
    for (j in L1){
      X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
    }
    for (j in L2){
      X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
    }
    for (j in L3){
      X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
    }
    for (j in L4){
      X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
    }
  }
    return (X2)
}

TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
binary(TEST)

The final result is showed us below: 最终结果如下所示:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
[1,]   NA   NA   NA   NA    1    0    0    0    1     0     0     0     1     0     0     0     1
[2,]   NA   NA   NA   NA    1    0    0    0    1     0     0     0     1     0     0     0     1
[3,]   NA   NA   NA   NA    1    0    0    0    1     0     0     0     1     0     0     0     1
[4,]   NA   NA   NA   NA    1    0    0    0    1     0     0     0     1     0     0     0     1
[5,]   NA   NA   NA   NA    1    0    0    0    1     0     0     0     1     0     0     0     1
     [,18] [,19] [,20]
[1,]     0     0     0
[2,]     0     0     0
[3,]     0     0     0
[4,]     0     0     0
[5,]     0     0     0

I hope my final sequence can all be translated to vector format. 我希望我的最终序列都可以转换为矢量格式。 As can be seen from the results, all except the first element in each sequence cannot fully be translated to the vector format 从结果可以看出,除了每个序列中的第一个元素之外的所有元素都不能完全转换为矢量格式

this is the correct answer i hope to achieve: 这是我希望实现的正确答案:

在此输入图像描述

this is the first time to use this to ask questions. 这是第一次用它来提问。 I feel really sorry to be unable to convey the question clearly 无法清楚地传达问题,我感到非常遗憾

Here is an option in base R with outer and == . 这是base R一个选项,带有outer== We split the 'TEST' by "" , do the elementwise comparison to give a list of logical matric es 我们通过分拆“测试” "" ,做的elementwise比较给一个list逻辑的matric ES

f1 <- function(x, y) outer(x, y, FUN = `==`)
lapply(strsplit(TEST, ""), f1, c("A", "G", "C", "U"))

data 数据

TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")

I think I would do this in a lapply-like operation. 我想我会在类似lapply的操作中这样做。

Example: 例:

TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")

vecDNA <- function(x){unlist(strsplit(x = x, split = "*"))}
binDNA <- function(x){
  data.frame(
    code=x, 
    G=as.numeric(x=="G"), 
    C=as.numeric(x=="C"), 
    A=as.numeric(x=="A"), 
    U=as.numeric(x=="U")
  )
}

T2 <- lapply(as.list(TEST),vecDNA)
T3 <- lapply(T2, binDNA)
T3

Result: 结果:

> T3
[[1]]
  code G C A U
1    A 0 0 1 0
2    C 0 1 0 0
3    G 1 0 0 0
4    U 0 0 0 1
5    C 0 1 0 0

[[2]]
  code G C A U
1    A 0 0 1 0
2    C 0 1 0 0
3    U 0 0 0 1
4    A 0 0 1 0
5    U 0 0 0 1

[[3]]
  code G C A U
1    U 0 0 0 1
2    C 0 1 0 0
3    G 1 0 0 0
4    U 0 0 0 1
5    A 0 0 1 0

[[4]]
  code G C A U
1    C 0 1 0 0
2    G 1 0 0 0
3    U 0 0 0 1
4    C 0 1 0 0
5    G 1 0 0 0

[[5]]
  code G C A U
1    U 0 0 0 1
2    A 0 0 1 0
3    G 1 0 0 0
4    U 0 0 0 1
5    G 1 0 0 0

Here's a different approach, I created a multilevel list for each of your sequences coding the letters with stringr::str_locate_all() : 这是一个不同的方法,我为每个使用stringr::str_locate_all()编码字母的序列创建了一个多级列表:

library(dplyr)
library(stringr)

TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")

coder <- function(string) {
  lapply(c("A","G","C","U"), function(x, y) {
    tmp <- rep(F, str_length(y))
    tmp[str_locate_all(y, x)[[1]][,1]] <- T
    tmp
  }, y = string) %>%
    setNames(c("A","G","C","U"))
}

dat <- lapply(TEST, coder) %>%
  setNames(TEST)

You can extract specific letters from a sequence with: 您可以从序列中提取特定字母:

dat$ACGUC$G

[1] FALSE FALSE  TRUE FALSE FALSE

or a data frame with: 或者数据框:

dat$ACGUC %>%
  bind_rows()

# A tibble: 5 x 4
  A     G     C     U    
  <lgl> <lgl> <lgl> <lgl>
1 TRUE  FALSE FALSE FALSE
2 FALSE FALSE TRUE  FALSE
3 FALSE TRUE  FALSE FALSE
4 FALSE FALSE FALSE TRUE 
5 FALSE FALSE TRUE  FALSE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM