如果分类变量的频率低于定义的值，则使用R重新编码变量

Question

Here is an example for the dataset (d): 这是数据集（d）的示例：

rs3 rs4 rs5 rs6
1   0   0   0
1   0   1   0
0   0   0   0
2   0   1   0
0   0   0   0
0   2   0   1
0   2   NA  1
0   2   2   1
NA  1   2   1

To check the frequency of the SNP genotype (0,1,2), we can use the table command 要检查SNP基因型（0,1,2）的频率，我们可以使用table命令

table (d$rs3)

The output would be 输出将是

0 1 2 
5 2 1

Here we want to recode the variables if the genotype 2's frequency is <3, the recoded output should be 如果基因型2的频率<3，我们想在这里重新编码变量，重新编码后的输出应该是

rs3 rs4 rs5 rs6
1   0   0   0
1   0   1   0
0   0   0   0
1   0   1   0
0   0   0   0
0   2   0   1
0   2   NA  1
0   2   1   1
NA  1   1   1

I have 70000SNPs that need to check and recode. 我有70000个SNP需要检查和重新编码。 How to use the for loop or other method to do that in R? 如何在R中使用for循环或其他方法执行此操作？

Answer 1

Here's another possible (vectorized) solution 这是另一个可能的（矢量化）解决方案

indx <- colSums(d == 2, na.rm = TRUE) < 3 # Select columns by condition
d[indx][d[indx] == 2] <- 1 # Inset 1 when the subset by condition equals 2
d
#   rs3 rs4 rs5 rs6
# 1   1   0   0   0
# 2   1   0   1   0
# 3   0   0   0   0
# 4   1   0   1   0
# 5   0   0   0   0
# 6   0   2   0   1
# 7   0   2  NA   1
# 8   0   2   1   1
# 9  NA   1   1   1

Answer 2

We can try 我们可以试试

 d[] <- lapply(d, function(x) 
    if(sum(x==2, na.rm=TRUE) < 3) replace(x, x==2, 1) else x)
d
#   rs3 rs4 rs5 rs6
#1   1   0   0   0
#2   1   0   1   0
#3   0   0   0   0
#4   1   0   1   0
#5   0   0   0   0
#6   0   2   0   1
#7   0   2  NA   1
#8   0   2   1   1
#9  NA   1   1   1

Or the same methodology can be used in dplyr 或者可以在dplyr使用相同的方法

library(dplyr)
d %>%
    mutate_each(funs(if(sum(.==2, na.rm=TRUE) <3) 
                replace(., .==2, 1) else .))

如果分类变量的频率低于定义的值，则使用R重新编码变量

问题描述

2 个解决方案

解决方案1
3 2016-01-14 09:54:43

解决方案2
2 2016-01-14 09:43:45

如果分类变量的频率低于定义的值，则使用R重新编码变量

问题描述

2 个解决方案

解决方案1 3 2016-01-14 09:54:43

解决方案2 2 2016-01-14 09:43:45

解决方案1
3 2016-01-14 09:54:43

解决方案2
2 2016-01-14 09:43:45