简体   繁体   中英

R data.table: Filter for rows by condition in multiple variables

I have a filter problem with the following data.table and really hope that someone can help me with that. I am not sure if there is an easy way of doing that and hope that it is not too much to ask for. So this is my problem:

A   B   C   Area
aa  M+H 1   127427
aa  M+H 2   204051.5
aa  M+Na    1   6855539.48777
aa  M+Na    2   6469689
bb  M+H 1   15330650
bb  M+H 2   214221
bb  M+H 3   11357158
bb  M+K 1   2140221
bb  M+K 2   61715568

For each group AB (aa M+H, aa M+Na, bb M+H, bb M+K) all rows with a value C > 1 should be filtered out if their Area value is higher than in the row with the same AB combination and a C value 1 (each ABC combination exists only once in the table). After that step the following rows should be left:

A   B   C   Area
aa  M+H 1   127427
aa  M+Na    1   6855539.48777
aa  M+Na    2   6469689
bb  M+H 1   15330650
bb  M+H 2   214221
bb  M+H 3   11357158
bb  M+K 1   2140221

and after that i would like to filter out all rows which are in the same AC group (aa 1, aa 2, bb 1, bb2) but with a higher Area value than in the row with an "M+H" as B value. So this should be left:

A   B   C   Area
aa  M+H 1   127427
aa  M+Na    2   6469689
bb  M+H 1   15330650
bb  M+H 2   214221
bb  M+H 3   11357158
bb  M+K 1   2140221

And in the end get rid of all AB groups (aa M+H, aa M+Na, bb M+H, bb M+K) that do not one row with a value of 1 in C left. So there should only be:

A   B   C   Area
aa  M+H 1   127427
bb  M+H 1   15330650
bb  M+H 2   214221
bb  M+H 3   11357158
bb  M+K 1   2140221

I was trying to get it done using data.table but if someone tells me that dplyr is much better for it I would also be happy for a solution there. Anyway thank you a lot for your time and effort!

Yasel

Welcome to SO!

Following your instructions I'm coming to a different result as yours, but you might be able to adapt it to your needs:

library(data.table)

DT <- data.table(stringsAsFactors=FALSE,
                 A = c("aa", "aa", "aa", "aa", "bb", "bb", "bb", "bb", "bb"),
                 B = c("M+H", "M+H", "M+Na", "M+Na", "M+H", "M+H", "M+H", "M+K",
                       "M+K"),
                 C = c(1L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 2L),
                 Area = c(127427, 204051.5, 6855539.48777, 6469689, 15330650, 214221,
                          11357158, 2140221, 61715568)
)

DT <- DT[DT[C==1], on=.(A, B)][i.Area-Area > 0 | C==1]
DT[, c("i.C", "i.Area") := NULL]

DT <- DT[DT[B=="M+H"], on=.(A, C)][i.Area-Area <= 0]
DT[, c("i.B", "i.Area") := NULL]

DT <- DT[DT[C==1], on=.(A, B)]
DT[, c("i.C", "i.Area") := NULL]

This isn't the most glamorous solution, but some variation of it might get you there:

library(data.table)

A <- c(rep("aa",4),rep("bb",5))
B <- c(rep("M+H",2),rep("M+Na",2),rep("M+H",3),rep("M+K",2))
C <- c(1,2,1,2,1,2,3,1,2)
Area <- c(127427,204051.5,6855539.48777,6469689,15330650,214221,11357158,2140221,61715568)
DT <- as.data.table(cbind(A,B,C,Area))

DT <- setorder(DT,A,B)
DT$ABFilter <- sapply(1:nrow(DT), function(x) ifelse((C[x]==1 
      || (C[x]==2 && A[x]==A[x-1] && B[x]==B[x-1] && (Area[x] < Area[x-1]))
      || (C[x]==3 && A[x]==A[x-2] && B[x]==B[x-2] && (Area[x] < Area[x-2])))
      , "Keep", "Discard"))
DT <- DT[ABFilter=="Keep",]
DT$ABFilter <- NULL
DT

DT <- setorder(DT,A,C)
DT$ACFilter <- sapply(1:nrow(DT), function(x) ifelse((B[x]=="M+H" 
      || (B[x]!="M+H" && A[x]==A[x-1] && C[x]==C[x-1] && B[x-1]=="M+H" && (Area[x] < Area[x-1]))
      || (B[x]!="M+H" && A[x]==A[x-2] && C[x]==C[x-2] && B[x-2]=="M+H" && (Area[x] < Area[x-2])))
      , "Keep", "Discard"))
DT <- DT[ACFilter=="Keep",]
DT$ACFilter <- NULL
DT

DT <- setorder(DT,A,B,C)
DT$ABCFilter <- sapply(1:nrow(DT), function(x) ifelse(C[x]==1 
      || (C[x]==2 && A[x]==A[x-1] && B[x]==B[x-1] && C[x-1]==1) 
      || (C[x]==3 && A[x]==A[x-1] && B[x]==B[x-1] && C[x-1]==1) 
      || (C[x]==3 && A[x]==A[x-2] && B[x]==B[x-2] && C[x-2]==1)
      , "Keep", "Discard"))
DT <- DT[ABCFilter=="Keep",]
DT$ABCFilter <- NULL
DT

I'm not so clear on the rules you're using either. It looks like the row with Area = 11357158 should get retained because it is lesser than the corresponding row with C = 1, and the row with Area = 6855539.48777 should get retained because it is greater than the corresponding row with B = M+H:

    A    B C          Area
1: aa  M+H 1        127427
2: aa M+Na 1 6855539.48777
3: aa M+Na 2       6469689
4: bb  M+H 1      15330650
5: bb  M+H 2        214221
6: bb  M+H 3      11357158
7: bb  M+K 1       2140221

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM