[英]Removing rows that have all zero values within one group in R
I have data set which is similar to the one below: 我有与下面类似的数据集:
d <- data.frame(A=c(11,11,11,11,21,21,111,111,111,44,44,44),
B=c(0,1,0,0,0,0,1,0,0,0,0,0),
C=c(3,2,1,3,4,2,1,2,3,12,22,31))
d
A B C
1 11 0 3
2 11 1 2
3 11 0 1
4 11 0 3
5 21 0 4
6 21 0 2
7 111 1 1
8 111 0 2
9 111 0 3
10 44 0 12
11 44 0 22
12 44 0 31
I want to remove rows where B=0 for each row within unique A. For example, when A=11, there is B=1 (the 2nd row), so it is ok. 我要删除唯一A内每行B = 0的行。例如,当A = 11时,有B = 1(第二行),所以可以。 By contrast, for A=21 all B's equal zero, so I want to remove all rows with A=21. 相比之下,对于A = 21,所有B都等于零,因此我想删除A = 21的所有行。 For A=44 again all B's are zero, so I want to remove all rows where A=44. 再次对于A = 44,所有B均为零,因此我想删除A = 44的所有行。
Finally, I need to get this data frame: 最后,我需要获取以下数据框:
new_d
A B C
1 11 0 3
2 11 1 2
3 11 0 1
4 11 0 3
5 111 1 12
6 111 0 22
7 111 0 31
PS Don't care about column C, I've added it just to show that there are more then 2 columns in data set. PS不在乎C列,我添加它只是为了表明数据集中有2列以上。
You can use ave
and logical subsetting like this: 您可以使用ave
和逻辑子集,如下所示:
d[!!ave(d$B, d$A, FUN=function(i) !all(i == 0)),]
A B C
1 11 0 3
2 11 1 2
3 11 0 1
4 11 0 3
7 111 1 1
8 111 0 2
9 111 0 3
Here, !all(i == 0)
returns TRUE when the vector contains a non-zero element. 在此,当向量包含非零元素时, !all(i == 0)
返回TRUE。 ave
performs this check on each group and returns a vector the same size as the initial vector, !!
ave
对每个组执行此检查,并返回与初始向量!!
大小相同的向量!!
converts it into a logical vector. 将其转换为逻辑向量。 This conversion is necessary because ave
will return a vector of the same type as the initial vector. 此转换是必需的,因为ave
将返回与初始向量相同类型的向量。 More explicitly than !!
比!!
更明确!!
would be as.logical
. 将是as.logical
。
d[as.logical(ave(d$B, d$A, FUN=function(i) !all(i == 0))),]
Or use a simple dplyr
operation: (btw I belive your expected output is off) 或使用简单的dplyr
操作:(顺便说一句,我相信您的预期输出已关闭)
require(dpylr)
d %>% group_by(A) %>% filter(sum(B) >= 1)
How about a base R
solution: base R
解决方案如何:
d[d$A %in% d$A[d$B!=0], ]
It's also pretty fast: 它也非常快:
library(microbenchmark)
library(dplyr)
set.seed(33) ## making a larger example
A <- do.call(c, lapply(sample(10000, 2000), function(x) rep(x, sample(100, 1))))
B <- sample(c(0,1), length(A), replace = TRUE, prob = c(18/19, 1/19))
C <- sample(10^5, length(A), replace = TRUE)
df <- data.frame(A, B, C)
superBase <- function(d) {d[d$A %in% d$A[d$B!=0], ]}
aveStat <- function(d) {d[!!ave(d$B, d$A, FUN=function(i) !all(i == 0)),]}
dplyrSol <- function(d) {d %>% group_by(A) %>% filter(sum(B) >= 1)}
microbenchmark(superBase(df), aveStat(df), dplyrSol(df))
Unit: milliseconds
expr min lq mean median uq max neval cld
superBase(df) 21.44030 23.81434 30.00466 26.67157 27.32492 167.1614 100 a
aveStat(df) 34.23338 39.03278 49.12483 40.29534 42.96865 204.0808 100 b
dplyrSol(df) 63.52571 65.32626 71.64950 67.20563 69.43784 215.5980 100 c
Gives the same results: 给出相同的结果:
identical(superBase(df), aveStat(df))
[1] TRUE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.