[英]A function that can translate DNA sequence to binary code
我正在設計一個可以將DNA序列轉換為四維矢量二進制代碼的函數。 例如“A” - (1,0,0,0)| “G-(0,1,0,0)” ...
我們還發現for循環中的()實際上可以影響結果。 我們希望找到這背后的原因。 例如4-1:7-1和(4-1):7-1完全不同,我們想找到背后的知識
NC1 <- function(data){
for(i in 1:length(data) ){
if(i==1){
DCfirst <- unlist(as.vector(strsplit(data[1],"",fixed = TRUE)))
DCsecond <- matrix(0,nrow = length(data),ncol = length(DCfirst))
DCsecond[1,] <- DCfirst
}else{
DCsecond[i,] <- unlist(as.vector(strsplit(data[i],"",fixed = TRUE)))
}
}
return(DCsecond)
}
binary<- function(data){
sequence_X<-NC1(data)
N=ncol(sequence_X)
X2<-matrix(NA,nrow=length(data),ncol=4*N)
for (i in 1 : N){
L1<-which(sequence_X[,i]=="A")
L2<-which(sequence_X[,i]=="G")
L3<-which(sequence_X[,i]=="C")
L4<-which(sequence_X[,i]=="U")
for (j in L1){
X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
}
for (j in L2){
X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
}
for (j in L3){
X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
}
for (j in L4){
X2[j, (4i-3):4i-1]<-unlist(c(1,0,0,0))
}
}
return (X2)
}
TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
binary(TEST)
最終結果如下所示:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
[1,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[2,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[3,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[4,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[5,] NA NA NA NA 1 0 0 0 1 0 0 0 1 0 0 0 1
[,18] [,19] [,20]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
[4,] 0 0 0
[5,] 0 0 0
我希望我的最終序列都可以轉換為矢量格式。 從結果可以看出,除了每個序列中的第一個元素之外的所有元素都不能完全轉換為矢量格式
這是我希望實現的正確答案:
這是第一次用它來提問。 無法清楚地傳達問題,我感到非常遺憾
這是base R
一個選項,帶有outer
和==
。 我們通過分拆“測試” ""
,做的elementwise比較給一個list
邏輯的matric
ES
f1 <- function(x, y) outer(x, y, FUN = `==`)
lapply(strsplit(TEST, ""), f1, c("A", "G", "C", "U"))
TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
我想我會在類似lapply的操作中這樣做。
TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
vecDNA <- function(x){unlist(strsplit(x = x, split = "*"))}
binDNA <- function(x){
data.frame(
code=x,
G=as.numeric(x=="G"),
C=as.numeric(x=="C"),
A=as.numeric(x=="A"),
U=as.numeric(x=="U")
)
}
T2 <- lapply(as.list(TEST),vecDNA)
T3 <- lapply(T2, binDNA)
T3
> T3
[[1]]
code G C A U
1 A 0 0 1 0
2 C 0 1 0 0
3 G 1 0 0 0
4 U 0 0 0 1
5 C 0 1 0 0
[[2]]
code G C A U
1 A 0 0 1 0
2 C 0 1 0 0
3 U 0 0 0 1
4 A 0 0 1 0
5 U 0 0 0 1
[[3]]
code G C A U
1 U 0 0 0 1
2 C 0 1 0 0
3 G 1 0 0 0
4 U 0 0 0 1
5 A 0 0 1 0
[[4]]
code G C A U
1 C 0 1 0 0
2 G 1 0 0 0
3 U 0 0 0 1
4 C 0 1 0 0
5 G 1 0 0 0
[[5]]
code G C A U
1 U 0 0 0 1
2 A 0 0 1 0
3 G 1 0 0 0
4 U 0 0 0 1
5 G 1 0 0 0
這是一個不同的方法,我為每個使用stringr::str_locate_all()
編碼字母的序列創建了一個多級列表:
library(dplyr)
library(stringr)
TEST <- c("ACGUC","ACUAU","UCGUA","CGUCG","UAGUG")
coder <- function(string) {
lapply(c("A","G","C","U"), function(x, y) {
tmp <- rep(F, str_length(y))
tmp[str_locate_all(y, x)[[1]][,1]] <- T
tmp
}, y = string) %>%
setNames(c("A","G","C","U"))
}
dat <- lapply(TEST, coder) %>%
setNames(TEST)
您可以從序列中提取特定字母:
dat$ACGUC$G
[1] FALSE FALSE TRUE FALSE FALSE
或者數據框:
dat$ACGUC %>%
bind_rows()
# A tibble: 5 x 4
A G C U
<lgl> <lgl> <lgl> <lgl>
1 TRUE FALSE FALSE FALSE
2 FALSE FALSE TRUE FALSE
3 FALSE TRUE FALSE FALSE
4 FALSE FALSE FALSE TRUE
5 FALSE FALSE TRUE FALSE
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.