Suppose the data below:
GroupId <- c(1,1,1,1,2,2,2,3,3)
IndId <- c(1,1,2,2,3,4,4,5,5)
IndGroupProperty <- c(1,2,1,2,3,3,4,5,6)
PropertyType <- c(1,2,1,2,2,2,1,2,2)
df <- data.frame(GroupId, IndId, IndGroupProperty, PropertyType)
df
These are multi-level data, where each group GroupId
consists of one or multiple individuals IndId
having access to one or more properties IndGroupProperty
, which are unique to their respective group (ie property 1 belongs to group 1 and no other group). These properties each belong to a type PropertyType
.
The task is to flag each row with a dummy variable where there is at least one type-1 property belonging to each individual in the group.
For our sample data, this simply is:
ValidGroup <- c(1,1,1,1,0,0,0,0,0)
df <- data.frame(df, ValidGroup)
df
The first four rows are flagged with a 1, because each individual (1, 2) of group (1) has access to a type-1 property (1). The three subsequent rows belong to group (2), in which only individual (4) has access to a type-1 property (4). Thus these are not flagged (0). The last two rows also receives no flag. Group (3) consists only of a single individual (5) with access to two type-2 properties (5, 6).
I have looked into several commands: levels
seems to lack group support; getGroups
in the nlme
package does not like the input of my real data; I guess that there might be something useful in doBy
, but summaryBy
does not seem to take levels
as a function.
Solution EDIT: dplyr
solution by Henrik wrapped into a function:
foobar <- function(object, group, ind, type){
groupvar <- deparse(substitute(group))
indvar <- deparse(substitute(ind))
typevar <- deparse(substitute(type))
eval(substitute(
object[, c(groupvar, indvar, typevar)] %.%
group_by(group, ind) %.%
mutate(type1 = any(type == 1)) %.%
group_by(group, add = FALSE) %.%
mutate(ValidGroup = all(type1) * 1) %.%
select(-type1)
))
}
You could also try ave
:
# for each individual within group, calculate number of 1s in PropertyType
v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))
# within each group, check if all v1 is 1.
# The boolean result is coerced to 1 and 0 by ave.
df$ValidGroup <- ave(v1, df$GroupId, FUN = function(x) all(x == 1))
# GroupId IndId IndGroupProperty PropertyType ValidGroup
# 1 1 1 1 1 1
# 2 1 1 2 2 1
# 3 1 2 1 1 1
# 4 1 2 2 2 1
# 5 2 3 3 2 0
# 6 2 4 3 2 0
# 7 2 4 4 1 0
# 8 3 5 5 2 0
# 9 3 5 6 2 0
Edit Added dplyr
alternative and benchmark for data sets of different size: original data, and data that are 10 and 100 times larger than original.
First wrap up the alternatives in functions:
fun_ave <- function(df){
v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))
df$ValidGroup <- ave(v1, list(df$GroupId), FUN = function(x) all(x == 1))
df
}
library(dplyr)
fun_dp <- function(df){
df %.%
group_by(GroupId, IndId) %.%
mutate(
type1 = any(PropertyType == 1)) %.%
group_by(GroupId, add = FALSE) %.%
mutate(
ValidGroup = all(type1) * 1) %.%
select(-type1)
}
fun_by <- function(df){
bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
foo <- table(xx$IndId,xx$PropertyType)
if ( !("1" %in% colnames(foo)) ) {
return(FALSE) # no PropertyType=1 at all in this group
} else {
return(all(foo[,"1"]>0)) # return whether all IndId have an 1 entry
}})
cbind(df,ValidGroup = as.integer(bar[as.character(df$GroupId)]))
}
Benchmarks
Original data:
microbenchmark(
fun_ave(df),
fun_dp(df),
fun_by(df))
# Unit: microseconds
# expr min lq median uq max neval
# fun_ave(df) 497.964 519.8215 538.8275 563.5355 651.535 100
# fun_dp(df) 851.861 870.6765 931.1170 968.5590 1760.360 100
# fun_by(df) 1343.743 1412.5455 1464.6225 1581.8915 12588.607 100
On a tiny data set ave
is about twice as fast as dplyr
and more than 2.5 times faster than by
.
Generate some larger data; 10 times the number of groups and individuals
GroupId <- sample(1:30, 100, replace = TRUE)
IndId <- sample(1:50, 100, replace = TRUE)
PropertyType <- sample(1:2, 100, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)
microbenchmark(
fun_ave(df2),
fun_dp(df2),
fun_by(df2))
# Unit: milliseconds
# expr min lq median uq max neval
# fun_ave(df2) 2.928865 3.185259 3.270978 3.435002 5.151457 100
# fun_dp(df2) 1.079176 1.231226 1.273610 1.352866 2.717896 100
# fun_by(df2) 9.464359 9.855317 10.137180 10.484994 12.445680 100
dplyr
is three times faster than ave
and nearly 10 times faster than by
.
100 times the number of groups and individuals
GroupId <- sample(1:300, 1000, replace = TRUE)
IndId <- sample(1:500, 1000, replace = TRUE)
PropertyType <- sample(1:2, 1000, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)
microbenchmark(
fun_ave(df2),
fun_dp(df2),
fun_by(df2))
# Unit: milliseconds
# expr min lq median uq max neval
# fun_ave(df2) 337.889895 392.983915 413.37554 441.58179 549.5516 100
# fun_dp(df2) 3.253872 3.477195 3.58173 3.73378 75.8730 100
# fun_by(df2) 92.248791 102.122733 104.09577 109.99285 186.6829 100
ave
is really loosing ground now. dplyr
is nearly 30 times faster than by
, and more than 100 times faster than ave
.
Try this:
bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
foo <- table(xx$IndId,xx$PropertyType)
if ( !("1" %in% colnames(foo)) ) {
return(FALSE) # no PropertyType=1 at all in this group
} else {
return(all(foo[,"1"]>0)) # return whether all IndId have an 1 entry
}})
cbind(df,bar[as.character(df$GroupId)])
The key is using by()
to apply a function by a grouping variable, here your df$GroupId
. The function to apply is an anonymous function. For each chunk (defined by the grouping variable), it creates a table
of the IndId
and PropertyType
entries. It then looks whether "1" appears at all in the PropertyType
- if not, it returns FALSE
, if yes, it looks whether every IndId
has at least one "1" entry (ie, whether all entries in the "1" column of the table
are >0).
We store the result of the by()
call in a structure bar
, which is named according to the levels in the grouping variable. This in turn allows us to roll the result back out to the original data.frame
. Note how I am using as.character()
here to make sure the integers are interpreted as entry names , not entry numbers . Bad Things often happen when things have names that can be interpreted as numbers.
If you really want a 0-1 result instead of TRUE-FALSE
, just add an as.numeric()
.
EDIT. Let's turn this into a function.
foobar <- function(object, group, ind, type) {
bar <- by(data=object,INDICES=object[,group],FUN=function(xx){
foo <- table(xx[,ind],xx[,type])
if ( !("1" %in% colnames(foo)) ) {
return(FALSE) # no PropertyType=1 at all in this group
} else {
return(all(foo[,"1"]>0)) # return whether all IndId have an 1 entry
}})
cbind(object,bar[as.character(object[,group])])
}
foobar(df,"GroupId","IndId","PropertyType")
This still requires that the target be exactly "1", but of course this could also be included in the function definition as a parameter. Just be sure to keep column names and variables that contain column names straight.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.