简体   繁体   English

如何根据每行中 R 中的/另一列替换几列的值?

[英]How to replace values of several columns based on/ another column in R within each row?

I am working on a data set (30000 x 500 ) where I need to replace some values in columns based on data from another column.我正在处理一个数据集(30000 x 500),我需要根据另一列的数据替换列中的一些值。 The problem is that in each row, the reference values change.问题是在每一行中,参考值都会发生变化。 Here is an sub-example of the data set:这是数据集的子示例:

#Create a data frame
df <- data.frame(SNP = c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP6","SNP7","SNP8","SNP9","SNP10"), 
                   A_allele = c("C","G","C","G","C","C","A","T","G","C"),
                   B_allele = c("G","A","T","A","A","G","T","A","C","A"),
                   alleles = c("C/G","G/A","C/T","G/A","C/A","C/G","A/T","T/A","G/C","C/A"),
                   line_1 = sample(c("A","B"),10, replace = TRUE),
                   line_2 = sample(c("A","B"),10, replace = TRUE),
                   line_3 = sample(c("A","B"),10, replace = TRUE),
                   line_4 = sample(c("A","B"),10, replace = TRUE),
                   line_5 = sample(c("A","B"),10, replace = TRUE),
                   line_6 = sample(c("A","B"),10, replace = TRUE),
                   line_7 = sample(c("A","B"),10, replace = TRUE),
                   line_8 = sample(c("A","B"),10, replace = TRUE),
                   line_9 = sample(c("A","B"),10, replace = TRUE),
                   line_10 = sample(c("A","B"),10, replace = TRUE)
                   )

df
head(df)
     SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1   SNP1        C        G     C/G      B      A      B      A      B      B      B      B      B       A
2   SNP2        G        A     G/A      A      B      A      A      A      B      B      A      B       A
3   SNP3        C        T     C/T      B      B      A      B      B      B      A      A      A       A
4   SNP4        G        A     G/A      A      B      B      A      B      A      B      B      B       A
5   SNP5        C        A     C/A      B      A      B      B      B      A      B      A      B       B
6   SNP6        C        G     C/G      B      A      B      A      B      A      B      B      B       B
7   SNP7        A        T     A/T      B      A      A      B      A      A      B      A      B       A
8   SNP8        T        A     T/A      A      B      A      B      A      A      B      B      A       B
9   SNP9        G        C     G/C      B      A      B      B      B      B      A      B      A       B
10 SNP10        C        A     C/A      B      B      B      B      B      A      A      A      A       A

For each row, A_allele and B_allele columns serve as reference values to change A or B values in the 10 lines.对于每一行,A_allele 和 B_allele 列作为参考值来更改 10 行中的 A 或 B 值。 When there is an "A" value => use the value from column A_allele and when there is a "B" value => use the value from column_B.当存在“A”值时 => 使用列 A_allele 中的值,当存在“B”值时 => 使用列_B 中的值。

In the example, this should be as following:在示例中,应如下所示:

  • Row 1: Change lines with A to C / Change lines with B to G第 1 行:将 A 行更改为 C / 将 B 行更改为 G
  • Row 2: Change lines with A to G / Change lines with B to A第 2 行:将 A 行更改为 G / 将 B 行更改为 A
  • Row 3: Change lines with A to C / Change lines with B to T第 3 行:将带 A 的行更改为 C / 将带 B 的行更改为 T
  • Row 10: The same idea.第 10 行:同样的想法。

The output should look something like this: output 应如下所示:

SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1   SNP1    C   G   C/G G   C   G   C   G   G   G   G   G   C
2   SNP2    G   A   G/A G   A   G   G   G   A   A   G   A   G
3   SNP3    C   T   C/T T   T   C   T   T   T   C   C   C   C
4   SNP4    G   A   G/A G   A   A   G   A   G   A   A   A   G
5   SNP5    C   A   C/A A   C   A   A   A   C   A   C   A   A
6   SNP6    C   G   C/G G   C   G   C   G   C   G   G   G   G
7   SNP7    A   T   A/T T   A   A   T   A   A   T   A   T   A
8   SNP8    T   A   T/A T   A   T   A   T   T   A   A   T   A
9   SNP9    G   C   G/C C   G   C   C   C   C   G   C   G   C
10  SNP10   C   A   C/A A   A   A   A   A   C   C   C   C   C

As there are ~30000 rows, I would like an efficient code to run if it possible.由于大约有 30000 行,如果可能的话,我想要一个高效的代码来运行。

Any suggestions?有什么建议么?

You can do你可以做

library(tidyverse)

df %>% mutate(across(starts_with("line"), ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))

#output with df generated with set.seed(2021)
     SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1   SNP1        C        G     C/G      C      C      G      C      C      C      G      G      C       G
2   SNP2        G        A     G/A      A      A      A      A      G      G      G      G      G       G
3   SNP3        C        T     C/T      T      T      C      C      T      T      T      T      T       C
4   SNP4        G        A     G/A      A      G      A      A      A      G      G      A      G       A
5   SNP5        C        A     C/A      C      C      C      A      C      A      A      C      C       A
6   SNP6        C        G     C/G      G      C      C      C      C      C      G      C      G       G
7   SNP7        A        T     A/T      T      A      T      T      T      T      T      A      T       A
8   SNP8        T        A     T/A      A      T      A      T      A      A      A      T      A       T
9   SNP9        G        C     G/C      C      C      C      C      C      G      G      G      C       C
10 SNP10        C        A     C/A      A      C      A      C      A      C      C      C      C       A

As stated in comments, if column name do not follow a pattern, Option-1 you can store these in a vector say vars and use this inside across如评论中所述,如果列名不遵循模式, Option-1您可以将它们存储在一个向量中,比如 vars 并across内部使用它

set.seed(2021)
df <- data.frame(SNP = c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP6","SNP7","SNP8","SNP9","SNP10"), 
                 A_allele = c("C","G","C","G","C","C","A","T","G","C"),
                 B_allele = c("G","A","T","A","A","G","T","A","C","A"),
                 alleles = c("C/G","G/A","C/T","G/A","C/A","C/G","A/T","T/A","G/C","C/A"),
                 line_1 = sample(c("A","B"),10, replace = TRUE),
                 line_2 = sample(c("A","B"),10, replace = TRUE),
                 line_3 = sample(c("A","B"),10, replace = TRUE),
                 line_4 = sample(c("A","B"),10, replace = TRUE),
                 line_5 = sample(c("A","B"),10, replace = TRUE),
                 line_6 = sample(c("A","B"),10, replace = TRUE),
                 line_7 = sample(c("A","B"),10, replace = TRUE),
                 cat = sample(c("A","B"),10, replace = TRUE),
                 dog = sample(c("A","B"),10, replace = TRUE),
                 rabbit = sample(c("A","B"),10, replace = TRUE)
)

vars <- c("line_1", "line_2", "line_3", "line_4", "line_5", "line_6", "line_7", "cat", "dog", "rabbit")

df %>% mutate(across(.cols = vars, ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))

Note: Using an external vector in selections is ambiguous.
i Use `all_of(vars)` instead of `vars` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
     SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 cat dog rabbit
1   SNP1        C        G     C/G      C      C      G      C      C      C      G   G   C      G
2   SNP2        G        A     G/A      A      A      A      A      G      G      G   G   G      G
3   SNP3        C        T     C/T      T      T      C      C      T      T      T   T   T      C
4   SNP4        G        A     G/A      A      G      A      A      A      G      G   A   G      A
5   SNP5        C        A     C/A      C      C      C      A      C      A      A   C   C      A
6   SNP6        C        G     C/G      G      C      C      C      C      C      G   C   G      G
7   SNP7        A        T     A/T      T      A      T      T      T      T      T   A   T      A
8   SNP8        T        A     T/A      A      T      A      T      A      A      A   T   A      T
9   SNP9        G        C     G/C      C      C      C      C      C      G      G   G   C      C
10 SNP10        C        A     C/A      A      C      A      C      A      C      C   C   C      A

Option-2 you may also column indexes directly Option-2你也可以直接列索引

df %>% mutate(across(5:14, ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))

     SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 cat dog rabbit
1   SNP1        C        G     C/G      C      C      G      C      C      C      G   G   C      G
2   SNP2        G        A     G/A      A      A      A      A      G      G      G   G   G      G
3   SNP3        C        T     C/T      T      T      C      C      T      T      T   T   T      C
4   SNP4        G        A     G/A      A      G      A      A      A      G      G   A   G      A
5   SNP5        C        A     C/A      C      C      C      A      C      A      A   C   C      A
6   SNP6        C        G     C/G      G      C      C      C      C      C      G   C   G      G
7   SNP7        A        T     A/T      T      A      T      T      T      T      T   A   T      A
8   SNP8        T        A     T/A      A      T      A      T      A      A      A   T   A      T
9   SNP9        G        C     G/C      C      C      C      C      C      G      G   G   C      C
10 SNP10        C        A     C/A      A      C      A      C      A      C      C   C   C      A

You can use across in dplyr along with ifelse .您可以across dplyrifelse中使用 cross。

library(dplyr)
df %>% mutate(across(starts_with('line'), ~ifelse(. == 'A', A_allele, B_allele)))

Or lapply in base R:或在基础lapply中应用:

cols <- grep('line', names(df))
df[cols] <- lapply(df[cols], function(x) ifelse(x == 'A', df$A_allele, df$B_allele))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何根据R中的/另一列替换几列的值? - How to replace values of several columns based on/ another column in R? 对于 r 中的每个组,将 NA 替换为同一列另一行中的值 - 值在组内不唯一 - Replace NA with values in another row of same column for each group in r - values not unique within group 将r替换为r中每个组的相同列的另一行中的值 - Replace NA with values in another row of same column for each group in r 如何基于R中另一列中的值替换列值? - How to replace column values based on values in another column in R? 如何根据行值和列名替换列中的值? - How to replace values in columns based on row values and column names? 如何根据R中另一列中的值替换数据框的列中的值? - How to replace values in the columns of a dataframe based on the values in the other column in R? 如何用另一列中的值替换不同列中的值? (右) - How to replace values in differents columns with values in another column? (R) 如何用基于另一列的总和值替换df行中的值 - How to replace values in a row of a df with summed values based on another column 如何根据 R 中的行名计算每列中的值总和? - How to calculate sum of values in each column based on row names in R? 基于另一列中的值子集的多个列的条件替换行值 - Conditional replace row values for multiple columns based on subset of values in another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM