简体   繁体   English

从R中的大data.frame中删除满足条件的列

[英]Removing columns that satisfy a condition from a big data.frame in R

I have a big data.frame; 我有一个大数据框架; 100,000 observations of 700 variables. 对700个变量进行100,000次观察。

Most of the variables have actually value 0 in all the observations, and I would like to remove that variables/columns. 在所有观察中,大多数变量的实际值为0,我想删除该变量/列。

I tried the following, 我尝试了以下方法

data <- data[apply(data, 2, function(x){all(x == 0)})]

But the apply took a lot of time to resolve. 但是申请花了很多时间才能解决。

I tried a while , in case the problem was working with all data at once. 我尝试了一段while ,以防该问题一次处理所有data

i <- 1
while (i <= ncol(data)) {
  if (all(data[i] == 0)) {
    data[i] <- NULL
  } else {
    i <- i+1
  }
}

But I kept having the same problem, it took a lot. 但是我一直遇到同样的问题,花了很多时间。

So, 所以,

  • Why does that operation take THAT long? 为什么该操作需要那么长时间? Even though the data.frame is big, the operation is pretty simple. 即使data.frame很大,操作也很简单。

and, above all 而且,最重要的是

  • Is there any way to do this faster? 有什么办法可以更快地做到这一点?

Your question is confusing. 您的问题令人困惑。 I assume you want to remove variables, ie, columns. 我假设您要删除变量,即列。 You can use any with automatic coercion of values to type logical. 您可以使用any具有自动强制值的类型来输入逻辑。 The usual warnings regarding comparison of floating point numbers apply. 有关比较浮点数的常规警告适用。 If you want to play it safe, you'll need to test whether the doubles are smaller than some precision value, which will be slower, but getting it right is often more important. 如果您想安全使用它,则需要测试双精度值是否小于某个精度值,该精度值会较慢,但正确设置通常更为重要。

DF <- data.frame(x = 1:3, y = 1:3/10, z = 0)
DF[] <- lapply(DF, function(x) if (any(x)) x else NULL)
#Warning messages:
#1: In any(x) : coercing argument of type 'double' to logical
#2: In any(x) : coercing argument of type 'double' to logical
#  x   y
#1 1 0.1
#2 2 0.2
#3 3 0.3

set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))

system.time(DF2[] <- lapply(DF2, function(x) if (any(x)) x else NULL))
#user  system elapsed 
#0.10    0.02    0.11 

Safer option: 更安全的选择:

set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))

system.time(DF2[] <- lapply(DF2, function(x) if (any(x > 1e-16)) x else NULL))
#user  system elapsed 
#0.34    0.11    0.45 

Using vectorized operation like colSums speeds up the operation on my machine - 使用colSums类的矢量化操作colSums加快我的计算机上的操作-

> set.seed(123)
> df = data.frame(matrix(sample(0:1,100000*700,replace = T,prob = c(0.9999999,0.0000001)), ncol = 700))
> system.time(df1 <- df[apply(df, 2, function(x){all(x == 0)})])
user  system elapsed 
1.386   0.821   2.225 
> system.time(df2 <- df[,which(colSums(df)==0)])
user  system elapsed 
0.243   0.082   0.326 
> identical(df1, df2)
[1] TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从data.frame获取行,该行满足由R中任意一个子条件组成的条件 - Get row(s) from data.frame that satisfy a condition composed by an arbitrary amout of sub-conditions in R 在 R 中按条件列出(几乎)相等的列 - List of (nearly) equal columns from a data.frame by condition in R R:如果满足条件,则从data.frame中删除列 - R: delete columns from data.frame if condition fulfilled 从R中的data.frame中删除每个间隔的重复行 - Removing repeated rows with each interval from data.frame in R big.matrix作为R中的data.frame - big.matrix as data.frame in R 使用来自其他 data.frame 列的值填充 data.frame 列,条件为 R - Fill a data.frame column with values from other data.frame column with a condition R R聚合条件为第二个数据帧的data.frame - R aggregate data.frame with condition from second dataframe 删除 data.frame R 中的空格 - removing whitespaces in data.frame R R,用另一个data.frame +动态列中的值替换data.frame中的值 - R, replace values in a data.frame by values from another data.frame + dynamic columns 如何从多个data.frame中获取特定列并将其保存为R中的新data.frame? - how to grab specific columns from multiple data.frame and save it as a new data.frame in R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM