从R中的大data.frame中删除满足条件的列

Question

I have a big data.frame; 我有一个大数据框架； 100,000 observations of 700 variables. 对700个变量进行100,000次观察。

Most of the variables have actually value 0 in all the observations, and I would like to remove that variables/columns. 在所有观察中，大多数变量的实际值为0，我想删除该变量/列。

I tried the following, 我尝试了以下方法

data <- data[apply(data, 2, function(x){all(x == 0)})]

But the apply took a lot of time to resolve. 但是申请花了很多时间才能解决。

I tried a while , in case the problem was working with all data at once. 我尝试了一段while ，以防该问题一次处理所有data 。

i <- 1
while (i <= ncol(data)) {
  if (all(data[i] == 0)) {
    data[i] <- NULL
  } else {
    i <- i+1
  }
}

But I kept having the same problem, it took a lot. 但是我一直遇到同样的问题，花了很多时间。

So, 所以，

Why does that operation take THAT long? 为什么该操作需要那么长时间？ Even though the data.frame is big, the operation is pretty simple. 即使data.frame很大，操作也很简单。

and, above all 而且，最重要的是

Is there any way to do this faster? 有什么办法可以更快地做到这一点？

Answer 1

Your question is confusing. 您的问题令人困惑。 I assume you want to remove variables, ie, columns. 我假设您要删除变量，即列。 You can use any with automatic coercion of values to type logical. 您可以使用any具有自动强制值的类型来输入逻辑。 The usual warnings regarding comparison of floating point numbers apply. 有关比较浮点数的常规警告适用。 If you want to play it safe, you'll need to test whether the doubles are smaller than some precision value, which will be slower, but getting it right is often more important. 如果您想安全使用它，则需要测试双精度值是否小于某个精度值，该精度值会较慢，但正确设置通常更为重要。

DF <- data.frame(x = 1:3, y = 1:3/10, z = 0)
DF[] <- lapply(DF, function(x) if (any(x)) x else NULL)
#Warning messages:
#1: In any(x) : coercing argument of type 'double' to logical
#2: In any(x) : coercing argument of type 'double' to logical
#  x   y
#1 1 0.1
#2 2 0.2
#3 3 0.3

set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))

system.time(DF2[] <- lapply(DF2, function(x) if (any(x)) x else NULL))
#user  system elapsed 
#0.10    0.02    0.11

Safer option: 更安全的选择：

set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))

system.time(DF2[] <- lapply(DF2, function(x) if (any(x > 1e-16)) x else NULL))
#user  system elapsed 
#0.34    0.11    0.45

Answer 2

Using vectorized operation like colSums speeds up the operation on my machine - 使用colSums类的矢量化操作colSums加快我的计算机上的操作-

> set.seed(123)
> df = data.frame(matrix(sample(0:1,100000*700,replace = T,prob = c(0.9999999,0.0000001)), ncol = 700))
> system.time(df1 <- df[apply(df, 2, function(x){all(x == 0)})])
user  system elapsed 
1.386   0.821   2.225 
> system.time(df2 <- df[,which(colSums(df)==0)])
user  system elapsed 
0.243   0.082   0.326 
> identical(df1, df2)
[1] TRUE

从R中的大data.frame中删除满足条件的列

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-03-30 10:50:59

解决方案2
1 2017-03-30 10:55:10

从R中的大data.frame中删除满足条件的列

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-03-30 10:50:59

解决方案2 1 2017-03-30 10:55:10

解决方案1
1 已采纳 2017-03-30 10:50:59

解决方案2
1 2017-03-30 10:55:10