[英]How to replace values of several columns based on/ another column in R within each row?
我正在处理一个数据集(30000 x 500),我需要根据另一列的数据替换列中的一些值。 问题是在每一行中,参考值都会发生变化。 这是数据集的子示例:
#Create a data frame
df <- data.frame(SNP = c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP6","SNP7","SNP8","SNP9","SNP10"),
A_allele = c("C","G","C","G","C","C","A","T","G","C"),
B_allele = c("G","A","T","A","A","G","T","A","C","A"),
alleles = c("C/G","G/A","C/T","G/A","C/A","C/G","A/T","T/A","G/C","C/A"),
line_1 = sample(c("A","B"),10, replace = TRUE),
line_2 = sample(c("A","B"),10, replace = TRUE),
line_3 = sample(c("A","B"),10, replace = TRUE),
line_4 = sample(c("A","B"),10, replace = TRUE),
line_5 = sample(c("A","B"),10, replace = TRUE),
line_6 = sample(c("A","B"),10, replace = TRUE),
line_7 = sample(c("A","B"),10, replace = TRUE),
line_8 = sample(c("A","B"),10, replace = TRUE),
line_9 = sample(c("A","B"),10, replace = TRUE),
line_10 = sample(c("A","B"),10, replace = TRUE)
)
df
head(df)
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1 SNP1 C G C/G B A B A B B B B B A
2 SNP2 G A G/A A B A A A B B A B A
3 SNP3 C T C/T B B A B B B A A A A
4 SNP4 G A G/A A B B A B A B B B A
5 SNP5 C A C/A B A B B B A B A B B
6 SNP6 C G C/G B A B A B A B B B B
7 SNP7 A T A/T B A A B A A B A B A
8 SNP8 T A T/A A B A B A A B B A B
9 SNP9 G C G/C B A B B B B A B A B
10 SNP10 C A C/A B B B B B A A A A A
对于每一行,A_allele 和 B_allele 列作为参考值来更改 10 行中的 A 或 B 值。 当存在“A”值时 => 使用列 A_allele 中的值,当存在“B”值时 => 使用列_B 中的值。
在示例中,应如下所示:
output 应如下所示:
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1 SNP1 C G C/G G C G C G G G G G C
2 SNP2 G A G/A G A G G G A A G A G
3 SNP3 C T C/T T T C T T T C C C C
4 SNP4 G A G/A G A A G A G A A A G
5 SNP5 C A C/A A C A A A C A C A A
6 SNP6 C G C/G G C G C G C G G G G
7 SNP7 A T A/T T A A T A A T A T A
8 SNP8 T A T/A T A T A T T A A T A
9 SNP9 G C G/C C G C C C C G C G C
10 SNP10 C A C/A A A A A A C C C C C
由于大约有 30000 行,如果可能的话,我想要一个高效的代码来运行。
有什么建议么?
你可以做
library(tidyverse)
df %>% mutate(across(starts_with("line"), ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))
#output with df generated with set.seed(2021)
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1 SNP1 C G C/G C C G C C C G G C G
2 SNP2 G A G/A A A A A G G G G G G
3 SNP3 C T C/T T T C C T T T T T C
4 SNP4 G A G/A A G A A A G G A G A
5 SNP5 C A C/A C C C A C A A C C A
6 SNP6 C G C/G G C C C C C G C G G
7 SNP7 A T A/T T A T T T T T A T A
8 SNP8 T A T/A A T A T A A A T A T
9 SNP9 G C G/C C C C C C G G G C C
10 SNP10 C A C/A A C A C A C C C C A
如评论中所述,如果列名不遵循模式, Option-1您可以将它们存储在一个向量中,比如 vars 并across
内部使用它
set.seed(2021)
df <- data.frame(SNP = c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP6","SNP7","SNP8","SNP9","SNP10"),
A_allele = c("C","G","C","G","C","C","A","T","G","C"),
B_allele = c("G","A","T","A","A","G","T","A","C","A"),
alleles = c("C/G","G/A","C/T","G/A","C/A","C/G","A/T","T/A","G/C","C/A"),
line_1 = sample(c("A","B"),10, replace = TRUE),
line_2 = sample(c("A","B"),10, replace = TRUE),
line_3 = sample(c("A","B"),10, replace = TRUE),
line_4 = sample(c("A","B"),10, replace = TRUE),
line_5 = sample(c("A","B"),10, replace = TRUE),
line_6 = sample(c("A","B"),10, replace = TRUE),
line_7 = sample(c("A","B"),10, replace = TRUE),
cat = sample(c("A","B"),10, replace = TRUE),
dog = sample(c("A","B"),10, replace = TRUE),
rabbit = sample(c("A","B"),10, replace = TRUE)
)
vars <- c("line_1", "line_2", "line_3", "line_4", "line_5", "line_6", "line_7", "cat", "dog", "rabbit")
df %>% mutate(across(.cols = vars, ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))
Note: Using an external vector in selections is ambiguous.
i Use `all_of(vars)` instead of `vars` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 cat dog rabbit
1 SNP1 C G C/G C C G C C C G G C G
2 SNP2 G A G/A A A A A G G G G G G
3 SNP3 C T C/T T T C C T T T T T C
4 SNP4 G A G/A A G A A A G G A G A
5 SNP5 C A C/A C C C A C A A C C A
6 SNP6 C G C/G G C C C C C G C G G
7 SNP7 A T A/T T A T T T T T A T A
8 SNP8 T A T/A A T A T A A A T A T
9 SNP9 G C G/C C C C C C G G G C C
10 SNP10 C A C/A A C A C A C C C C A
Option-2你也可以直接列索引
df %>% mutate(across(5:14, ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 cat dog rabbit
1 SNP1 C G C/G C C G C C C G G C G
2 SNP2 G A G/A A A A A G G G G G G
3 SNP3 C T C/T T T C C T T T T T C
4 SNP4 G A G/A A G A A A G G A G A
5 SNP5 C A C/A C C C A C A A C C A
6 SNP6 C G C/G G C C C C C G C G G
7 SNP7 A T A/T T A T T T T T A T A
8 SNP8 T A T/A A T A T A A A T A T
9 SNP9 G C G/C C C C C C G G G C C
10 SNP10 C A C/A A C A C A C C C C A
您可以across
dplyr
和ifelse
中使用 cross。
library(dplyr)
df %>% mutate(across(starts_with('line'), ~ifelse(. == 'A', A_allele, B_allele)))
或在基础lapply
中应用:
cols <- grep('line', names(df))
df[cols] <- lapply(df[cols], function(x) ifelse(x == 'A', df$A_allele, df$B_allele))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.