简体   繁体   中英

Flagging groups in which all members fulfill a certain requirement in R

Suppose the data below:

GroupId <-          c(1,1,1,1,2,2,2,3,3)
IndId <-            c(1,1,2,2,3,4,4,5,5)
IndGroupProperty <- c(1,2,1,2,3,3,4,5,6)
PropertyType <-     c(1,2,1,2,2,2,1,2,2)

df <- data.frame(GroupId, IndId, IndGroupProperty, PropertyType)
df

These are multi-level data, where each group GroupId consists of one or multiple individuals IndId having access to one or more properties IndGroupProperty , which are unique to their respective group (ie property 1 belongs to group 1 and no other group). These properties each belong to a type PropertyType .

The task is to flag each row with a dummy variable where there is at least one type-1 property belonging to each individual in the group.

For our sample data, this simply is:

ValidGroup <-       c(1,1,1,1,0,0,0,0,0)
df <- data.frame(df, ValidGroup)
df

The first four rows are flagged with a 1, because each individual (1, 2) of group (1) has access to a type-1 property (1). The three subsequent rows belong to group (2), in which only individual (4) has access to a type-1 property (4). Thus these are not flagged (0). The last two rows also receives no flag. Group (3) consists only of a single individual (5) with access to two type-2 properties (5, 6).

I have looked into several commands: levels seems to lack group support; getGroups in the nlme package does not like the input of my real data; I guess that there might be something useful in doBy , but summaryBy does not seem to take levels as a function.

Solution EDIT: dplyr solution by Henrik wrapped into a function:

foobar <- function(object, group, ind, type){
groupvar <- deparse(substitute(group)) 
indvar <- deparse(substitute(ind)) 
typevar <- deparse(substitute(type)) 
eval(substitute(
object[, c(groupvar, indvar, typevar)] %.%
  group_by(group, ind) %.%
  mutate(type1 = any(type == 1))  %.%
  group_by(group, add = FALSE) %.%
  mutate(ValidGroup = all(type1) * 1) %.%
  select(-type1)
  ))
}

You could also try ave :

# for each individual within group, calculate number of 1s in PropertyType
v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))

# within each group, check if all v1 is 1.
# The boolean result is coerced to 1 and 0 by ave.  
df$ValidGroup <- ave(v1, df$GroupId, FUN = function(x) all(x == 1))

#   GroupId IndId IndGroupProperty PropertyType ValidGroup
# 1       1     1                1            1          1
# 2       1     1                2            2          1
# 3       1     2                1            1          1
# 4       1     2                2            2          1
# 5       2     3                3            2          0
# 6       2     4                3            2          0
# 7       2     4                4            1          0
# 8       3     5                5            2          0
# 9       3     5                6            2          0

Edit Added dplyr alternative and benchmark for data sets of different size: original data, and data that are 10 and 100 times larger than original.

First wrap up the alternatives in functions:

fun_ave <- function(df){
  v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))
df$ValidGroup <- ave(v1, list(df$GroupId), FUN = function(x) all(x == 1))
df  
}

library(dplyr)
fun_dp <- function(df){
df %.%
  group_by(GroupId, IndId) %.%
  mutate(
    type1 = any(PropertyType == 1)) %.%
  group_by(GroupId, add = FALSE) %.%
  mutate(
    ValidGroup = all(type1) * 1) %.%
  select(-type1)
}


fun_by <- function(df){
  bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
    foo <- table(xx$IndId,xx$PropertyType)
    if ( !("1" %in% colnames(foo)) ) {
      return(FALSE)   # no PropertyType=1 at all in this group
    } else {
      return(all(foo[,"1"]>0))    # return whether all IndId have an 1 entry
    }})
  cbind(df,ValidGroup = as.integer(bar[as.character(df$GroupId)]))
}

Benchmarks

Original data:

microbenchmark(
  fun_ave(df),
  fun_dp(df),
  fun_by(df))

# Unit: microseconds
#        expr      min        lq    median        uq       max neval
# fun_ave(df)  497.964  519.8215  538.8275  563.5355   651.535   100
#  fun_dp(df)  851.861  870.6765  931.1170  968.5590  1760.360   100
#  fun_by(df) 1343.743 1412.5455 1464.6225 1581.8915 12588.607   100

On a tiny data set ave is about twice as fast as dplyr and more than 2.5 times faster than by .

Generate some larger data; 10 times the number of groups and individuals

GroupId <- sample(1:30, 100, replace = TRUE)
IndId <- sample(1:50, 100, replace = TRUE)
PropertyType <- sample(1:2, 100, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)

microbenchmark(
  fun_ave(df2),
  fun_dp(df2),
  fun_by(df2))
# Unit: milliseconds
#          expr      min       lq    median        uq       max neval
#  fun_ave(df2) 2.928865 3.185259  3.270978  3.435002  5.151457   100
#   fun_dp(df2) 1.079176 1.231226  1.273610  1.352866  2.717896   100
#   fun_by(df2) 9.464359 9.855317 10.137180 10.484994 12.445680   100

dplyr is three times faster than ave and nearly 10 times faster than by .

100 times the number of groups and individuals

GroupId <- sample(1:300, 1000, replace = TRUE)
IndId <- sample(1:500, 1000, replace = TRUE)
PropertyType <- sample(1:2, 1000, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)

microbenchmark(
  fun_ave(df2),
  fun_dp(df2),
  fun_by(df2))

# Unit: milliseconds
# expr        min         lq    median        uq      max neval
# fun_ave(df2) 337.889895 392.983915 413.37554 441.58179 549.5516   100
#  fun_dp(df2)   3.253872   3.477195   3.58173   3.73378  75.8730   100
#  fun_by(df2)  92.248791 102.122733 104.09577 109.99285 186.6829   100

ave is really loosing ground now. dplyr is nearly 30 times faster than by , and more than 100 times faster than ave .

Try this:

bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
    foo <- table(xx$IndId,xx$PropertyType)
    if ( !("1" %in% colnames(foo)) ) {
        return(FALSE)   # no PropertyType=1 at all in this group
    } else {
        return(all(foo[,"1"]>0))    # return whether all IndId have an 1 entry
    }})
cbind(df,bar[as.character(df$GroupId)])

The key is using by() to apply a function by a grouping variable, here your df$GroupId . The function to apply is an anonymous function. For each chunk (defined by the grouping variable), it creates a table of the IndId and PropertyType entries. It then looks whether "1" appears at all in the PropertyType - if not, it returns FALSE , if yes, it looks whether every IndId has at least one "1" entry (ie, whether all entries in the "1" column of the table are >0).

We store the result of the by() call in a structure bar , which is named according to the levels in the grouping variable. This in turn allows us to roll the result back out to the original data.frame . Note how I am using as.character() here to make sure the integers are interpreted as entry names , not entry numbers . Bad Things often happen when things have names that can be interpreted as numbers.

If you really want a 0-1 result instead of TRUE-FALSE , just add an as.numeric() .


EDIT. Let's turn this into a function.

foobar <- function(object, group, ind, type) {
    bar <- by(data=object,INDICES=object[,group],FUN=function(xx){
        foo <- table(xx[,ind],xx[,type])
        if ( !("1" %in% colnames(foo)) ) {
            return(FALSE)   # no PropertyType=1 at all in this group
        } else {
            return(all(foo[,"1"]>0))    # return whether all IndId have an 1 entry
        }})
    cbind(object,bar[as.character(object[,group])])
}

foobar(df,"GroupId","IndId","PropertyType")

This still requires that the target be exactly "1", but of course this could also be included in the function definition as a parameter. Just be sure to keep column names and variables that contain column names straight.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM