简体   繁体   中英

subsetting with dplyr or other packages in R

So we have this basic data

A <- c(1,1,1,2,2,2,3,3,3)
B <- c(1,0,0,1,0,0,1,0,0)
C <- c(2,2,3,4,3,3,3,4,3)
Data <- data.frame(A,B,C)

And we now want to filter it such that

We look at each case where C=3 and B=1 and find it A value using this A value, we want to search for C=4 and B=0 and delete it

So graphically,

在此输入图像描述

we want to remove the red row and keep the green one

I ideally want to use dplyr but am open to other alternatives

Base R, I think this is what you need:

# Column A values to consider for filter
C3_B1 <- Data[Data$C==3 & Data$B==1,"A"]

# Filter out rows where C==4 and B==0
Data[ !(Data$A %in% C3_B1 &
          Data$C==4 &
          Data$B==0),]

# Output - row 8 is removed
#   A B C
# 1 1 1 2
# 2 1 0 2
# 3 1 0 3
# 4 2 1 4
# 5 2 0 3
# 6 2 0 3
# 7 3 1 3
# 9 3 0 3

Using sqldf package:

sqldf(c("DELETE FROM Data 
          WHERE A IN (SELECT A FROM Data 
                       WHERE  C = 3 AND B = 1) 
                AND C = 4 
                AND B = 0",
        "SELECT * FROM Data "))

With dplyr

Data %>% filter(!(A==select(filter(Data, C==3 & B==1), A)[,1] & C==4 & B ==0))

Its a two step process in essence.

The 1st one is:

select(filter(Data, C==3 & B==1), A)

Filter Data where C==3 and B==1 then select only column A . lets name this temp .

The 2nd step is:

Data %>% filter(!(A==temp[,1] & C==4 & B ==0))

a simple filtering.

Here's another data.table approach. First, read it to a keyed data.table:

require(data.table)
DT <- data.table(Data,key="C,B,A")

Identify the A value the OP doesn't like

myA <- DT[J(3,1)]$A

Exclude

DT[!J(4,0,myA)]

This will also work if myA is a vector.

You mention other alternatives, so here is data.table

library(data.table)
 d=as.data.table(Data)
 d[!((A==d[B==1 & C==3,A])&(C==4&B==0))]

But not sure how this will work on a more complicated case with more rows to delete.

Another solution, probably a bit slower than some of the above. But might be a bit more transparent to the reader. Also easier to throw into a function if you want to run different values of B and C.

a_values <- Data %>% 
  filter(B == 1, C == 3) %>% 
  select(A) %>% 
  unique()

Data <- Data %>% 
  filter(!(A %in% a_values & B == 0 & C == 4))

Edit: forgot to negate!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM