[英]A function that can translate DNA sequence to binary code
我正在设计一个可以将DNA序列转换为四维矢量二进制代码的函数。 例如“A” - (1,0,0,0)| “G-(0,1,0,0)” ...
我们还发现for循环中的()实际上可以影响结果。 我们希望找到这背后的原因。 例如4-1:7-1和(4-1):7-1完全不同,我们想找到背后的知识
NC1 <- function(data){
for(i in 1:length(data) ){
if(i==1){
DCfirst <- unlist(as.vector(strsplit(data[1],"",fixed = TRUE)))
DCsecond <- matrix(0,nrow = length(data),ncol = length(DCfirst))
DCsecond[1,] <- DCfirst
}else{
DCsecond[i,] <- unlist(as.vector(strsplit(data[i],"",fixed = TRUE)))
}
}
return(DCsecond)
}
binary<- function(data){
sequence_X<-NC1(data)
N=ncol(sequence_X)
X2<-matrix(NA,nrow=length(data),ncol=4*N)
for (i in 1 : N){
L1<-which(sequence_X[,i]=="A")
L2<-which(sequence_X[,i]=="G")
L3<-which(sequence_X[,i]=="C")
L4<-which(sequence_X[,i]=="U")
for (j in L1){
X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
}
for (j in L2){
X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
}
for (j in L3){
X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
}
for (j in L4){
X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
}
}
return (X2)
}
TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
binary(TEST)
最终结果如下所示:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
[1,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[2,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[3,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[4,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[5,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[,18] [,19] [,20]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
[4,] 0 0 0
[5,] 0 0 0
我希望我的最终序列都可以转换为矢量格式。 从结果可以看出,除了每个序列中的第一个元素之外的所有元素都不能完全转换为矢量格式
这是我希望实现的正确答案:
这是第一次用它来提问。 无法清楚地传达问题,我感到非常遗憾
这是base R
一个选项,带有outer
和==
。 我们通过分拆“测试” ""
,做的elementwise比较给一个list
逻辑的matric
ES
f1 <- function(x, y) outer(x, y, FUN = `==`)
lapply(strsplit(TEST, ""), f1, c("A", "G", "C", "U"))
TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
我想我会在类似lapply的操作中这样做。
TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
vecDNA <- function(x){unlist(strsplit(x = x, split = "*"))}
binDNA <- function(x){
data.frame(
code=x,
G=as.numeric(x=="G"),
C=as.numeric(x=="C"),
A=as.numeric(x=="A"),
U=as.numeric(x=="U")
)
}
T2 <- lapply(as.list(TEST),vecDNA)
T3 <- lapply(T2, binDNA)
T3
> T3
[[1]]
code G C A U
1 A 0 0 1 0
2 C 0 1 0 0
3 G 1 0 0 0
4 U 0 0 0 1
5 C 0 1 0 0
[[2]]
code G C A U
1 A 0 0 1 0
2 C 0 1 0 0
3 U 0 0 0 1
4 A 0 0 1 0
5 U 0 0 0 1
[[3]]
code G C A U
1 U 0 0 0 1
2 C 0 1 0 0
3 G 1 0 0 0
4 U 0 0 0 1
5 A 0 0 1 0
[[4]]
code G C A U
1 C 0 1 0 0
2 G 1 0 0 0
3 U 0 0 0 1
4 C 0 1 0 0
5 G 1 0 0 0
[[5]]
code G C A U
1 U 0 0 0 1
2 A 0 0 1 0
3 G 1 0 0 0
4 U 0 0 0 1
5 G 1 0 0 0
这是一个不同的方法,我为每个使用stringr::str_locate_all()
编码字母的序列创建了一个多级列表:
library(dplyr)
library(stringr)
TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
coder <- function(string) {
lapply(c("A","G","C","U"), function(x, y) {
tmp <- rep(F, str_length(y))
tmp[str_locate_all(y, x)[[1]][,1]] <- T
tmp
}, y = string) %>%
setNames(c("A","G","C","U"))
}
dat <- lapply(TEST, coder) %>%
setNames(TEST)
您可以从序列中提取特定字母:
dat$ACGUC$G
[1] FALSE FALSE TRUE FALSE FALSE
或者数据框:
dat$ACGUC %>%
bind_rows()
# A tibble: 5 x 4
A G C U
<lgl> <lgl> <lgl> <lgl>
1 TRUE FALSE FALSE FALSE
2 FALSE FALSE TRUE FALSE
3 FALSE TRUE FALSE FALSE
4 FALSE FALSE FALSE TRUE
5 FALSE FALSE TRUE FALSE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.