[英]How to remove columns with more than 90% values as '0' in R
I had categorical variables, which I converted to dummy variables and got over 2381 variables. 我有分类变量,将其转换为虚拟变量并获得了2381个以上的变量。 I won't be needing that many variables for analysis (say regression or correlation).
我将不需要那么多变量来进行分析(例如回归或相关)。 I want to remove columns if over 90% of the total values in a given column is '0'.
如果给定列的总值的90%以上为“ 0”,我想删除列。 Also, is there a good metric to remove columns other than 90% of values being '0' ?
另外,是否有一个很好的指标可以删除90%的值为“ 0”以外的列? Help!
救命!
This will give you a data.frame
without the columns where more than 90% of the elements are 0
: 这将为您提供一个
data.frame
,其中没有90%以上的元素为0
:
df[sapply(df, function(x) mean(x == 0) <= 0.9)]
Or more elgantly as markus suggests: 或更确切地说,如马库斯所说:
df[colMeans(df == 0) <= 0.9]
This is easily done with colSums
: 这很容易用
colSums
完成:
Example data: 示例数据:
df <- data.frame(x = c(rep(0, 9), 1),
y = c(rep(0,9), 1),
z = c(rep(0, 8), 1, 1))
> df
x y z
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 1 1 1
df[, colSums(df == 0)/nrow(df) < .9, drop = FALSE]
z
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 1
10 1
Regarding the question about a useful metric, this heavily depends on what you want to analyze. 关于有用指标的问题,在很大程度上取决于您要分析的内容。 Even a column with above 90 %
0
values may be useful for a regression model. 即使具有大于90%
0
值的列也可能对回归模型有用。 I would look at the content of the variable, or use a stepwise exclusion based on AIC or BIC to measure the relevance of your variables. 我会查看变量的内容,或者使用基于AIC或BIC的逐步排除来衡量变量的相关性。
Hy, I wrote some code with the dplyr
package. dplyr
,我用dplyr
包写了一些代码。 Here is some example how you can ged rid of columns with more than 90% of zeros in it: 以下是一些示例,您可以如何消除其中包含90%以上的零的列:
library(dplyr)
df <- data.frame(colA=sample(c(0,1), 100, replace=TRUE, prob=c(0.8,02)),
colB=sample(c(0,1), 100, replace=TRUE, prob=c(0.99,001)),
colC=sample(c(0,1), 100, replace=TRUE, prob=c(0.5,05)),
colD=sample(c(0,1), 100, replace=TRUE, prob=c(0,1)),
colE=rep(0, 100))
fct <- function (x) x==0
zero_count <- df %>% mutate_all(funs(fct)) %>% summarise_all(sum)
col_filter <- zero_count <= 0.9 * nrow(df)
df_filter <- df[, col_filter]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.