简体   繁体   English

如果分类变量的频率低于定义的值,则使用R重新编码变量

[英]R to recode variables if the categorical variable's frequency lower than an defined value

Here is an example for the dataset (d): 这是数据集(d)的示例:

rs3 rs4 rs5 rs6
1   0   0   0
1   0   1   0
0   0   0   0
2   0   1   0
0   0   0   0
0   2   0   1
0   2   NA  1
0   2   2   1
NA  1   2   1

To check the frequency of the SNP genotype (0,1,2), we can use the table command 要检查SNP基因型(0,1,2)的频率,我们可以使用table命令

table (d$rs3)

The output would be 输出将是

0 1 2 
5 2 1

Here we want to recode the variables if the genotype 2's frequency is <3, the recoded output should be 如果基因型2的频率<3,我们想在这里重新编码变量,重新编码后的输出应该是

rs3 rs4 rs5 rs6
1   0   0   0
1   0   1   0
0   0   0   0
1   0   1   0
0   0   0   0
0   2   0   1
0   2   NA  1
0   2   1   1
NA  1   1   1

I have 70000SNPs that need to check and recode. 我有70000个SNP需要检查和重新编码。 How to use the for loop or other method to do that in R? 如何在R中使用for循环或其他方法执行此操作?

Here's another possible (vectorized) solution 这是另一个可能的(矢量化)解决方案

indx <- colSums(d == 2, na.rm = TRUE) < 3 # Select columns by condition
d[indx][d[indx] == 2] <- 1 # Inset 1 when the subset by condition equals 2
d
#   rs3 rs4 rs5 rs6
# 1   1   0   0   0
# 2   1   0   1   0
# 3   0   0   0   0
# 4   1   0   1   0
# 5   0   0   0   0
# 6   0   2   0   1
# 7   0   2  NA   1
# 8   0   2   1   1
# 9  NA   1   1   1

We can try 我们可以试试

 d[] <- lapply(d, function(x) 
    if(sum(x==2, na.rm=TRUE) < 3) replace(x, x==2, 1) else x)
d
#   rs3 rs4 rs5 rs6
#1   1   0   0   0
#2   1   0   1   0
#3   0   0   0   0
#4   1   0   1   0
#5   0   0   0   0
#6   0   2   0   1
#7   0   2  NA   1
#8   0   2   1   1
#9  NA   1   1   1

Or the same methodology can be used in dplyr 或者可以在dplyr使用相同的方法

library(dplyr)
d %>%
    mutate_each(funs(if(sum(.==2, na.rm=TRUE) <3) 
                replace(., .==2, 1) else .))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM