[英]Removing columns that satisfy a condition from a big data.frame in R
I have a big data.frame; 我有一个大数据框架; 100,000 observations of 700 variables.
对700个变量进行100,000次观察。
Most of the variables have actually value 0 in all the observations, and I would like to remove that variables/columns. 在所有观察中,大多数变量的实际值为0,我想删除该变量/列。
I tried the following, 我尝试了以下方法
data <- data[apply(data, 2, function(x){all(x == 0)})]
But the apply took a lot of time to resolve. 但是申请花了很多时间才能解决。
I tried a while
, in case the problem was working with all data
at once. 我尝试了一段
while
,以防该问题一次处理所有data
。
i <- 1
while (i <= ncol(data)) {
if (all(data[i] == 0)) {
data[i] <- NULL
} else {
i <- i+1
}
}
But I kept having the same problem, it took a lot. 但是我一直遇到同样的问题,花了很多时间。
So, 所以,
and, above all 而且,最重要的是
Your question is confusing. 您的问题令人困惑。 I assume you want to remove variables, ie, columns.
我假设您要删除变量,即列。 You can use
any
with automatic coercion of values to type logical. 您可以使用
any
具有自动强制值的类型来输入逻辑。 The usual warnings regarding comparison of floating point numbers apply. 有关比较浮点数的常规警告适用。 If you want to play it safe, you'll need to test whether the doubles are smaller than some precision value, which will be slower, but getting it right is often more important.
如果您想安全使用它,则需要测试双精度值是否小于某个精度值,该精度值会较慢,但正确设置通常更为重要。
DF <- data.frame(x = 1:3, y = 1:3/10, z = 0)
DF[] <- lapply(DF, function(x) if (any(x)) x else NULL)
#Warning messages:
#1: In any(x) : coercing argument of type 'double' to logical
#2: In any(x) : coercing argument of type 'double' to logical
# x y
#1 1 0.1
#2 2 0.2
#3 3 0.3
set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))
system.time(DF2[] <- lapply(DF2, function(x) if (any(x)) x else NULL))
#user system elapsed
#0.10 0.02 0.11
Safer option: 更安全的选择:
set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))
system.time(DF2[] <- lapply(DF2, function(x) if (any(x > 1e-16)) x else NULL))
#user system elapsed
#0.34 0.11 0.45
Using vectorized operation like colSums
speeds up the operation on my machine - 使用
colSums
类的矢量化操作colSums
加快我的计算机上的操作-
> set.seed(123)
> df = data.frame(matrix(sample(0:1,100000*700,replace = T,prob = c(0.9999999,0.0000001)), ncol = 700))
> system.time(df1 <- df[apply(df, 2, function(x){all(x == 0)})])
user system elapsed
1.386 0.821 2.225
> system.time(df2 <- df[,which(colSums(df)==0)])
user system elapsed
0.243 0.082 0.326
> identical(df1, df2)
[1] TRUE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.