如何在R中删除值为90％以上的列为'0'的列

Question

I had categorical variables, which I converted to dummy variables and got over 2381 variables. 我有分类变量，将其转换为虚拟变量并获得了2381个以上的变量。 I won't be needing that many variables for analysis (say regression or correlation). 我将不需要那么多变量来进行分析（例如回归或相关）。 I want to remove columns if over 90% of the total values in a given column is '0'. 如果给定列的总值的90％以上为“ 0”，我想删除列。 Also, is there a good metric to remove columns other than 90% of values being '0' ? 另外，是否有一个很好的指标可以删除90％的值为“ 0”以外的列？ Help! 救命！

Answer 1

This will give you a data.frame without the columns where more than 90% of the elements are 0 : 这将为您提供一个data.frame ，其中没有90％以上的元素为0 ：

df[sapply(df, function(x) mean(x == 0) <= 0.9)]

Or more elgantly as markus suggests: 或更确切地说，如马库斯所说：

df[colMeans(df == 0) <= 0.9]

Answer 2

This is easily done with colSums : 这很容易用colSums完成：

Example data: 示例数据：

df <- data.frame(x = c(rep(0, 9), 1),
                 y = c(rep(0,9), 1),
                 z = c(rep(0, 8), 1, 1))

> df
   x y z
1  0 0 0
2  0 0 0
3  0 0 0
4  0 0 0
5  0 0 0
6  0 0 0
7  0 0 0
8  0 0 0
9  0 0 1
10 1 1 1

df[, colSums(df == 0)/nrow(df) < .9, drop = FALSE]
   z
1  0
2  0
3  0
4  0
5  0
6  0
7  0
8  0
9  1
10 1

Regarding the question about a useful metric, this heavily depends on what you want to analyze. 关于有用指标的问题，在很大程度上取决于您要分析的内容。 Even a column with above 90 % 0 values may be useful for a regression model. 即使具有大于90％ 0值的列也可能对回归模型有用。 I would look at the content of the variable, or use a stepwise exclusion based on AIC or BIC to measure the relevance of your variables. 我会查看变量的内容，或者使用基于AIC或BIC的逐步排除来衡量变量的相关性。

Answer 3

Hy, I wrote some code with the dplyr package. dplyr ，我用dplyr包写了一些代码。 Here is some example how you can ged rid of columns with more than 90% of zeros in it: 以下是一些示例，您可以如何消除其中包含90％以上的零的列：

library(dplyr)

df <- data.frame(colA=sample(c(0,1), 100, replace=TRUE, prob=c(0.8,02)),
                 colB=sample(c(0,1), 100, replace=TRUE, prob=c(0.99,001)),
                 colC=sample(c(0,1), 100, replace=TRUE, prob=c(0.5,05)),
                 colD=sample(c(0,1), 100, replace=TRUE, prob=c(0,1)),
                 colE=rep(0, 100))

fct <- function (x) x==0

zero_count <- df %>% mutate_all(funs(fct)) %>% summarise_all(sum)

col_filter <- zero_count <= 0.9 * nrow(df)

df_filter <- df[, col_filter]

如何在R中删除值为90％以上的列为'0'的列

问题描述

3 个解决方案

解决方案1
1 2018-12-18 08:24:11

解决方案2
0 2018-12-18 08:18:07

解决方案3
0 2018-12-18 08:41:01

如何在R中删除值为90％以上的列为&#39;0&#39;的列

问题描述

3 个解决方案

解决方案1 1 2018-12-18 08:24:11

解决方案2 0 2018-12-18 08:18:07

解决方案3 0 2018-12-18 08:41:01

如何在R中删除值为90％以上的列为'0'的列

解决方案1
1 2018-12-18 08:24:11

解决方案2
0 2018-12-18 08:18:07

解决方案3
0 2018-12-18 08:41:01