![](/img/trans.png)
[英]How to extract values from a column into the dataframe by matching two other columns in R
[英]Extract values in two different columns matching in other columns in R
我有一個叫做mymat
矩陣(大約為 446664 X 234 的暗淡)。 它有REF
和ALT
列,它們可以包含 A、T、G、C 中的任何一個字母(只有一個字母)。 在以.GT
結尾的列中,我想替換這些字母。 要匹配的條件是,如果有 0,我想用 REF 列中的字母替換它,如果有 1,那么我想用 ALT 列中的字母替換它。 如果有NA,我想用“0”“0”(即零空間零)替換它。 最后,我需要反轉所有的 .GT 列(轉置),如結果所示。 結果,一切都被空格隔開。
mymat<-structure(c("G", "A", "C", "A", "G", "A", "C", "T", "G", "A",
"1/1", "0/0", "0/0", "NA", "NA", "0,15", "8,0", "8,0", "NA",
"NA", "1/1", "0/1", "0/0", "NA", "NA", "0,35", "12,12", "15,0",
"NA", "NA"), .Dim = 5:6, .Dimnames = list(c("chrX:133511988:133511988:G:A:snp",
"chrX:133528116:133528116:A:C:snp", "chrX:133528186:133528186:C:T:snp",
"chrX:133560301:133560301:A:G:snp", "chrX:133561242:133561242:G:A:snp"
), c("REF", "ALT", "02688.GT", "02688.AD", "02689.GT", "02689.AD"
)))
結果
02688.GT A A A A C C 0 0 0 0
02689.GT A A A C C C 0 0 0 0
你可以試試:
library(dplyr)
library(stringi)
## convert to data.frame
data.frame(mymat, check.names = FALSE) %>%
## replace the values ("0", "1", "/", "NA") in all columns ending with ".GT" with
## the corresponding values in "REF" and "ALT" (" " for "/" and "0 0" for "NA")
mutate_each(funs(stri_replace_all(., REF, fixed = "0")), ends_with(".GT")) %>%
mutate_each(funs(stri_replace_all(., ALT, fixed = "1")), ends_with(".GT")) %>%
mutate_each(funs(stri_replace_all(., " ", fixed = "/")), ends_with(".GT")) %>%
mutate_each(funs(stri_replace_all(., "0 0", fixed = "NA")), ends_with(".GT")) %>%
## keep only the columns ending with ".GT"
select(ends_with(".GT")) %>%
## transpose the results
t()
這使:
[,1] [,2] [,3] [,4] [,5]
02688.GT "A A" "A A" "C C" "0 0" "0 0"
02689.GT "A A" "A C" "C C" "0 0" "0 0"
我正在發布我自己的答案,但速度真的很慢,因此需要進一步優化。
letters <- strsplit(paste(mymat[,"REF"],mymat[,"ALT"],sep=","),",") # concatenate the letters to have an index to work on from the numbers
values <- t(mymat[,c(which(colnames(mymat)%in%lapply(all.samples,function(x)(paste(x,"GT",sep=".")))))]) # working on each column needing values
nbval <- ncol(values) # Keeping track of total number of columns and saving the length of values
#Preparing the two temp vectors to be used below
chars <- vector("character",2)
ret <- vector("character",nbval)
#Loop over the rows (and transpose the result)
mydata<-t(sapply(rownames(values),
function(x) {
indexes <- strsplit(values[x,],"/") # Get a list with pairs of indexes
for(i in 1:nbval) { # Loop over the number of columns :/
for (j in 1:2) { # Loop over the pair
chars[j] <- ifelse(indexes[i] == "NA", 0,letters[[i]][as.integer(indexes[[i]][j])+1]) # Get '0' if "NA" or the letter with the correct index at this postion
}
ret[i] <- paste(chars[1],chars[2], sep=" ") # concatenate the two chars
}
return(ret) # return this for this row
}
))
所以這只是部分答案,我不知道它對 > 200000 行的處理效果如何。 但也許更聰明的人會想出如何做得更好。
temp1 = strsplit(mymat[,3],"/")
reps = sapply(temp1,length)
refalt = data.frame(REF = rep(mymat[,1],times=reps),ALT = rep(mymat[,2],times=reps),ZERO = "0 0")
GT1 = unlist(temp1)
GT1[GT1=="NA"] = "2"
GT1 = as.numeric(GT1)+1
paste(refalt[cbind(1:8,GT1)]," ")
它是不完整的,因為我們需要將它包裝在一個可以傳遞給 apply() 或 lapply() 的函數中,並在行首捕獲變量名。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.