[英]Find and replace conditional column value by row for a large data frame in R
All, I have a large genomic data that I want to do find-and-replace based on "REF" and "ALT" column values for each row.总之,我有一个大型基因组数据,我想根据每一行的“REF”和“ALT”列值进行查找和替换。 The replacement rules are:
替换规则是:
Here is a small subset of an example data:这是示例数据的一小部分:
Chr Pos REF ALT A B C D E
1 70 A G 1 0 1 1 1
1 80 T G 1 0 3 3 3
1 100 C T 1 0 1 1 1
2 20 G A 1 0 0 0 1
2 80 C T 1 0 0 0 2
The desired output is:所需的 output 是:
Chr Pos REF ALT A B C D E
1 70 A G A/A -/- A/A A/A A/A
1 80 T G T/T -/- T/G T/G T/G
1 100 C T C/C -/- C/C C/C C/C
2 20 G A G/G -/- -/- -/- G/G
2 80 C T C/C -/- -/- -/- T/T
Reproducible data frame:可重现的数据框:
df=data.frame(
Chr=c(1,1,1,2,2),
Pos=c(70,80,100,20,80),
REF=c("A","T","C","G","C"),
ALT=c("G","G","T","A","T"),
A=c(1,1,1,1,1),
B=c(0,0,0,0,0),
C=c(1,3,1,0,0),
D=c(1,3,1,0,0),
E=c(1,3,1,1,2)
)
I wrote a for loop for the task:我为这个任务写了一个for循环:
K=data.frame()
for (r in 1:nrow(df)){
k=df[r,]
ks=df[r,-1:-4]
ks[ks==0]="-/-"
ks[ks==1]=paste0(k$REF,"/",k$REF)
ks[ks==2]=paste0(k$ALT,"/",k$ALT)
ks[ks==3]=paste0(k$REF,"/",k$ALT)
ks=cbind(k[,1:4],ks)
K=rbind(K,ks)
}
That worked okay, however, I have about 150,000 lines of them and the row by row operation takes very long time so I was wondering if there is a faster way to process this?这没问题,但是,我有大约 150,000 行,逐行操作需要很长时间,所以我想知道是否有更快的方法来处理这个问题?
Thank you very much for the help!非常感谢你的帮助!
We may use case_when
- loop across
the columns 'A' to 'E', create the format with sprintf
based on the condition created in `case_when我们可以使用
case_when
- 遍历“A” across
“E”列,根据在“case_when”中创建的条件使用sprintf
创建格式
library(dplyr)
df %>%
mutate(across(A:E, ~ case_when(. == 0 ~ '-/-',
. == 1 ~ sprintf('%1$s/%1$s', REF),
.==2 ~ sprintf('%1$s/%1$s', ALT),
TRUE ~ sprintf('%s/%s', REF, ALT))))
-output -输出
Chr Pos REF ALT A B C D E
1 1 70 A G A/A -/- A/A A/A A/A
2 1 80 T G T/T -/- T/G T/G T/G
3 1 100 C T C/C -/- C/C C/C C/C
4 2 20 G A G/G -/- -/- -/- G/G
5 2 80 C T C/C -/- -/- -/- T/T
NOTE: As this is done by column and not by row, it should be much faster than the OP's approach注意:由于这是按列而不是按行完成的,因此它应该比 OP 的方法快得多
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.