简体   繁体   English

如何在 dplyr 中使用或/和对 data.frame 进行子集化

[英]How to use or/and in dplyr to subset a data.frame

I would like to subset a data.frame with a combination of or/and.我想用 or/and 的组合对 data.frame 进行子集化。 This is my code using normal R function.这是我使用普通 R 函数的代码。

df <- expand.grid(list(A = seq(1, 5), B = seq(1, 5), C = seq(1, 5)))
df$value <- seq(1, nrow(df))

df[(df$A == 1 & df$B == 3) |
    (df$A == 3 & df$B == 2),]

How could I convert them using filter function in dplyr package?如何使用 dplyr 包中的过滤器功能转换它们? Thanks for any suggestions.感谢您的任何建议。

dplyr solution: dplyr解决方案:

load library:加载库:

library(dplyr)

filter with condition as above:过滤条件如上:

df %>% filter(A == 1 & B == 3 | A == 3 & B ==2)

You could use subset() and [ as well.您也可以使用subset()[ Here are some different methods and their respective benchmarks on a larger data set.以下是一些不同的方法及其在更大数据集上的各自基准。

df <- expand.grid(A = 1:100, B = 1:100, C = 1:100)
df$value <- 1:nrow(df)

library(dplyr); library(microbenchmark)
f1 <- function() subset(df, A == 1 & B == 3 | A == 3 & B == 2)
f2 <- function() filter(df, A == 1 & B == 3 | A == 3 & B == 2)
f3 <- function() df[with(df, A == 1 & B == 3 | A == 3 & B == 2), ]
f4 <- function() df[(df$A == 1 & df$B == 3) | (df$A == 3 & df$B == 2),]

microbenchmark(subset = f1(), filter = f2(), with = f3(), "$" = f4())
# Unit: milliseconds
#    expr      min       lq     mean   median       uq      max neval
#  subset 47.42671 49.99802 75.95385 92.24430 96.05960 141.2964   100
#  filter 36.94019 38.77325 60.22831 42.64112 84.35896 155.0145   100
#    with 38.90918 44.36299 71.29214 86.39629 88.89008 134.7670   100
#       $ 40.22723 44.08606 71.32186 86.71372 89.59275 133.1132   100

Interesting.有趣的。 I was trying to see the difference in terms of the resulting dataset and I coulnd't get an explanation to why the good old "[" operator behaved differently:我试图查看结果数据集的差异,但我无法解释为什么旧的“[”运算符的行为不同:

# Subset for year=2013
sub<-brfss2013 %>% filter(iyear == "2013")
dim(sub)
#[1] 486088    330
length(which(is.na(sub$iyear))==T)
#[1] 0

sub2<-filter(brfss2013, iyear == "2013")
dim(sub2)
#[1] 486088    330
length(which(is.na(sub2$iyear))==T)
#[1] 0

sub3<-brfss2013[brfss2013$iyear=="2013", ]
dim(sub3)
#[1] 486093    330
length(which(is.na(sub3$iyear))==T)
#[1] 5

sub4<-subset(brfss2013, iyear=="2013")
dim(sub4)
#[1] 486088    330
length(which(is.na(sub4$iyear))==T)
#[1] 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM