简体   繁体   English

检查数据帧中的所有值是否都满足条件(条件是矢量)

[英]Check if all values in data frame meet a condition (the condition is a vector)

I have a data frame which looks as follows: 我有一个数据框架,如下所示:

muestra[1:10,2:5]
##       X0 X1 X2 X3
## 21129  0  0  0  0
## 34632  0  0  0  0
## 30612  0  0  0  0
## 10687  0  0  1  2
## 44815  0  0  0  1
## 40552  0  0  0  1
## 15311  0  0  0  0
## 33960  0  0  0  0
## 24073  0  0  0  0
## 13077  0  0  0  0

I'm comparing the rows for a particular vector of values: 我正在比较行的特定值向量:

muestra[1:10,2:5] == c(0,0,0,0)
##         X0   X1    X2    X3
## 21129 TRUE TRUE  TRUE  TRUE
## 34632 TRUE TRUE  TRUE  TRUE
## 30612 TRUE TRUE  TRUE  TRUE
## 10687 TRUE TRUE FALSE FALSE
## 44815 TRUE TRUE  TRUE FALSE
## 40552 TRUE TRUE  TRUE FALSE
## 15311 TRUE TRUE  TRUE  TRUE
## 33960 TRUE TRUE  TRUE  TRUE
## 24073 TRUE TRUE  TRUE  TRUE
## 13077 TRUE TRUE  TRUE  TRUE

The value of the comparisson vector might change; 比较向量的值可能会改变; ie it can be c(0,0,1,0) , c(1,2,1,2) , etcetera. 即它可以是c(0,0,1,0)c(1,2,1,2)等。

I'd like to check if the full row meets the condition; 我想检查整行是否满足条件; Is there a function that returns something like this: 是否有返回以下内容的函数:

some_function(muestra[1:10,2:5], c(0,0,0,0))
##        some_function(muestra[1:10,2:5], c(0,0,0,0))
## 21129                                       TRUE
## 34632                                       TRUE
## 30612                                       TRUE
## 10687                                      FALSE
## 44815                                      FALSE
## 40552                                      FALSE
## 15311                                       TRUE
## 33960                                       TRUE
## 24073                                       TRUE
## 13077                                       TRUE

You are looking for all() . 您正在寻找all() Apply all() to each row. all()应用于每一行。

Let's consider a more general target vector, say y <- c(0,0,1,0) , then we could do: 让我们考虑一个更通用的目标向量,例如y <- c(0,0,1,0) ,那么我们可以这样做:

x <- muestra[1:10,2:5]
apply(x == rep(y, each = nrow(x)), 1, all)

apply is inefficient as it is not vectorized. apply没有效率,因为它没有向量化。 If I am to do this job I would choose rowSums() . 如果要执行此工作,我将选择rowSums() I would use: 我会用:

rowSums(x == rep(y, each = nrow(x))) == ncol(x)

I am happy to make a benchmark, too. 我也很高兴成为基准。 I know for the first time that there is a function col . 我第一次知道有一个函数col But it seems that using rep is slightly more efficient: 但是似乎使用rep效率更高:

set.seed(123)
x <- matrix(sample(1e7), ncol = 10)
y <- sample(10)

library(microbenchmark)
microbenchmark("  ZL_apply:" = apply(x == rep(y, each = nrow(x)), 1, all),
               "ZL_rowSums:" = rowSums(x == rep(y, each = nrow(x))) == ncol(x),

               "        DA:" = rowSums(x == y[col(x)]) == ncol(x))

Unit: milliseconds
        expr       min        lq      mean    median        uq       max neval
   ZL_apply: 3278.6738 3312.5376 3349.2760 3347.4750 3378.5720 3506.4211   100
 ZL_rowSums:  314.2683  318.1528  331.2623  324.5413  336.5447  427.5261   100
         DA:  422.7039  432.3683  461.4871  461.8067  476.1305  624.4142   100

Pardon me for not liking by-row operations. 请原谅我不喜欢按行操作。 I would combine col with rowSums instead 我将colrowSums结合使用

rowSums(df == c(0,0,0,0)[col(df)]) == ncol(df)
# 21129 34632 30612 10687 44815 40552 15311 33960 24073 13077 
#  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE 

Some benchmark 一些基准

set.seed(123)
df <- as.data.frame(matrix(sample(1e7), ncol = 10)) 
vec <- sample(10)

library(microbenchmark)
microbenchmark("ZL: " = apply(df== vec, 1, all),
               "DA: " = rowSums(df == vec[col(df)]) == ncol(df))

# Unit: milliseconds
# expr      min        lq      mean    median        uq      max neval cld
# ZL:  2262.580 2386.5286 2421.7244 2420.6767 2454.1483 2592.888   100   b
# DA:   786.121  807.1531  836.7408  827.1577  849.9955 1038.139   100  a 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM